In [ ]:
1. Introduction
A brief description of your dataset: This data set seeks to find correlation between
This database presents measurements from specific production lines operating in an electrical cable manufacturing plant. The data were captured over four consecutive weeks, across all three daily shifts. The collected measurements include data on the human element (operator of the machine or machines on the line), environmental conditions, training, safety variables, and reports of unsafe conditions noted by the line supervisor during the shift. This database was compiled to understand the occurrence of workplace accidents and the underlying processes that provoke them. Include these variables
In [ ]:
context/domain/industry- Electrical Manufacturing Plant
In [147]:
# Checking lengths of each list in the data dictionary
data = {
'Variable Name': [
'ID_LINE', 'AGE_OPERATOR', 'YEARS_EXP', 'SENIORITY', 'EMPLOYEE_CAT', 'HOURS_OFTRAINING_SECURITY',
'HOURS_OFTRAINING_POSITION', 'GRADE_TEOREXAM', 'GRADE_PRACTICALEXAM', 'NUMBER_ILLS', 'SCORE_RISKOFMACH',
'SCORE_ILLUM', 'NOISE_ATPLACE', 'NUMBER_EXTRAHOURS', 'NUMBER_RESTHOURS', 'SCORE_HIDRAT', 'USE_PPE',
'USE_ADEQTOOLS', 'SUFFER_ANXIETY', 'EXPOSED_QUIM', 'SCORE_ILLUM', 'AVAILABLE_SPACE', 'SCORE_FATIGUE',
'EVAL_TIMEAVAIL', 'EVAL_KNOWSUFFIC', 'TEMP_PLACEOFWORK', 'ACA'
],
'Measurement Type': [
'Nominal', 'Ratio', 'Ratio', 'Ratio', 'Ordinal', 'Ratio', 'Ratio', 'Scale', 'Scale', 'Ratio', 'Scale', 'Scale',
'Scale', 'Ratio', 'Ratio', 'Ratio', 'Scale', 'Binary', 'Binary', 'Binary', 'Scale', 'Scale', 'Scale', 'Scale', 'Scale',
'Ratio', 'Nominal'
],
'Role': [
'Excluded', 'Predictor', 'Predictor', 'Predictor', 'Predictor', 'Predictor', 'Predictor', 'Predictor',
'Predictor', 'Predictor', 'Predictor', 'Predictor', 'Predictor', 'Predictor', 'Predictor', 'Predictor',
'Outcome', 'Outcome', 'Outcome', 'Predictor', 'Predictor', 'Predictor', 'Predictor', 'Predictor','Predictor', 'Predictor',
'Outcome'
],
'Industry': [
'Manufacturing', 'Manufacturing', 'Manufacturing', 'Manufacturing', 'Manufacturing', 'Manufacturing',
'Manufacturing', 'Manufacturing', 'Manufacturing', 'Manufacturing', 'Manufacturing', 'Manufacturing',
'Manufacturing', 'Manufacturing', 'Manufacturing', 'Manufacturing', 'Manufacturing', 'Manufacturing',
'Manufacturing', 'Manufacturing', 'Manufacturing', 'Manufacturing', 'Manufacturing', 'Manufacturing',
'Manufacturing', 'Manufacturing', 'Manufacturing'
]
}
# Creating the DataFrame
df = pd.DataFrame(data)
# Displaying the DataFrame
print(df)
Variable Name Measurement Type Role Industry 0 ID_LINE Nominal Excluded Manufacturing 1 AGE_OPERATOR Ratio Predictor Manufacturing 2 YEARS_EXP Ratio Predictor Manufacturing 3 SENIORITY Ratio Predictor Manufacturing 4 EMPLOYEE_CAT Ordinal Predictor Manufacturing 5 HOURS_OFTRAINING_SECURITY Ratio Predictor Manufacturing 6 HOURS_OFTRAINING_POSITION Ratio Predictor Manufacturing 7 GRADE_TEOREXAM Scale Predictor Manufacturing 8 GRADE_PRACTICALEXAM Scale Predictor Manufacturing 9 NUMBER_ILLS Ratio Predictor Manufacturing 10 SCORE_RISKOFMACH Scale Predictor Manufacturing 11 SCORE_ILLUM Scale Predictor Manufacturing 12 NOISE_ATPLACE Scale Predictor Manufacturing 13 NUMBER_EXTRAHOURS Ratio Predictor Manufacturing 14 NUMBER_RESTHOURS Ratio Predictor Manufacturing 15 SCORE_HIDRAT Ratio Predictor Manufacturing 16 USE_PPE Scale Outcome Manufacturing 17 USE_ADEQTOOLS Binary Outcome Manufacturing 18 SUFFER_ANXIETY Binary Outcome Manufacturing 19 EXPOSED_QUIM Binary Predictor Manufacturing 20 SCORE_ILLUM Scale Predictor Manufacturing 21 AVAILABLE_SPACE Scale Predictor Manufacturing 22 SCORE_FATIGUE Scale Predictor Manufacturing 23 EVAL_TIMEAVAIL Scale Predictor Manufacturing 24 EVAL_KNOWSUFFIC Scale Predictor Manufacturing 25 TEMP_PLACEOFWORK Ratio Predictor Manufacturing 26 ACA Nominal Outcome Manufacturing
In [1]:
!pip install pandas openpyxl
Requirement already satisfied: pandas in c:\users\wsher\anaconda3\lib\site-packages (2.2.2) Requirement already satisfied: openpyxl in c:\users\wsher\anaconda3\lib\site-packages (3.1.5) Requirement already satisfied: numpy>=1.26.0 in c:\users\wsher\anaconda3\lib\site-packages (from pandas) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\wsher\anaconda3\lib\site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in c:\users\wsher\anaconda3\lib\site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in c:\users\wsher\anaconda3\lib\site-packages (from pandas) (2023.3) Requirement already satisfied: et-xmlfile in c:\users\wsher\anaconda3\lib\site-packages (from openpyxl) (1.1.0) Requirement already satisfied: six>=1.5 in c:\users\wsher\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
In [2]:
!pip install seaborn
Requirement already satisfied: seaborn in c:\users\wsher\anaconda3\lib\site-packages (0.13.2) Requirement already satisfied: numpy!=1.24.0,>=1.20 in c:\users\wsher\anaconda3\lib\site-packages (from seaborn) (1.26.4) Requirement already satisfied: pandas>=1.2 in c:\users\wsher\anaconda3\lib\site-packages (from seaborn) (2.2.2) Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in c:\users\wsher\anaconda3\lib\site-packages (from seaborn) (3.9.2) Requirement already satisfied: contourpy>=1.0.1 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.0) Requirement already satisfied: cycler>=0.10 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.1) Requirement already satisfied: pillow>=8 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.4.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2) Requirement already satisfied: python-dateutil>=2.7 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in c:\users\wsher\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2024.1) Requirement already satisfied: tzdata>=2022.7 in c:\users\wsher\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2023.3) Requirement already satisfied: six>=1.5 in c:\users\wsher\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
In [3]:
! pip install ISLP
Requirement already satisfied: ISLP in c:\users\wsher\anaconda3\lib\site-packages (0.4.0) Requirement already satisfied: numpy>=1.7.1 in c:\users\wsher\anaconda3\lib\site-packages (from ISLP) (1.26.4) Requirement already satisfied: scipy>=0.9 in c:\users\wsher\anaconda3\lib\site-packages (from ISLP) (1.11.4) Requirement already satisfied: pandas>=0.20 in c:\users\wsher\anaconda3\lib\site-packages (from ISLP) (2.2.2) Requirement already satisfied: lxml in c:\users\wsher\anaconda3\lib\site-packages (from ISLP) (5.2.1) Requirement already satisfied: scikit-learn>=1.2 in c:\users\wsher\anaconda3\lib\site-packages (from ISLP) (1.5.1) Requirement already satisfied: joblib in c:\users\wsher\anaconda3\lib\site-packages (from ISLP) (1.4.2) Requirement already satisfied: statsmodels>=0.13 in c:\users\wsher\anaconda3\lib\site-packages (from ISLP) (0.14.2) Requirement already satisfied: lifelines in c:\users\wsher\anaconda3\lib\site-packages (from ISLP) (0.30.0) Requirement already satisfied: pygam in c:\users\wsher\anaconda3\lib\site-packages (from ISLP) (0.9.1) Requirement already satisfied: torch in c:\users\wsher\anaconda3\lib\site-packages (from ISLP) (2.6.0) Requirement already satisfied: pytorch-lightning in c:\users\wsher\anaconda3\lib\site-packages (from ISLP) (2.5.0.post0) Requirement already satisfied: torchmetrics in c:\users\wsher\anaconda3\lib\site-packages (from ISLP) (1.6.2) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\wsher\anaconda3\lib\site-packages (from pandas>=0.20->ISLP) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in c:\users\wsher\anaconda3\lib\site-packages (from pandas>=0.20->ISLP) (2024.1) Requirement already satisfied: tzdata>=2022.7 in c:\users\wsher\anaconda3\lib\site-packages (from pandas>=0.20->ISLP) (2023.3) Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\wsher\anaconda3\lib\site-packages (from scikit-learn>=1.2->ISLP) (3.5.0) Requirement already satisfied: patsy>=0.5.6 in c:\users\wsher\anaconda3\lib\site-packages (from statsmodels>=0.13->ISLP) (0.5.6) Requirement already satisfied: packaging>=21.3 in c:\users\wsher\anaconda3\lib\site-packages (from statsmodels>=0.13->ISLP) (24.1) Requirement already satisfied: matplotlib>=3.0 in c:\users\wsher\anaconda3\lib\site-packages (from lifelines->ISLP) (3.9.2) Requirement already satisfied: autograd>=1.5 in c:\users\wsher\anaconda3\lib\site-packages (from lifelines->ISLP) (1.7.0) Requirement already satisfied: autograd-gamma>=0.3 in c:\users\wsher\anaconda3\lib\site-packages (from lifelines->ISLP) (0.5.0) Requirement already satisfied: formulaic>=0.2.2 in c:\users\wsher\anaconda3\lib\site-packages (from lifelines->ISLP) (1.1.1) Requirement already satisfied: progressbar2<5.0.0,>=4.2.0 in c:\users\wsher\anaconda3\lib\site-packages (from pygam->ISLP) (4.5.0) Requirement already satisfied: tqdm>=4.57.0 in c:\users\wsher\anaconda3\lib\site-packages (from pytorch-lightning->ISLP) (4.66.5) Requirement already satisfied: PyYAML>=5.4 in c:\users\wsher\anaconda3\lib\site-packages (from pytorch-lightning->ISLP) (6.0.1) Requirement already satisfied: fsspec>=2022.5.0 in c:\users\wsher\anaconda3\lib\site-packages (from fsspec[http]>=2022.5.0->pytorch-lightning->ISLP) (2024.6.1) Requirement already satisfied: typing-extensions>=4.4.0 in c:\users\wsher\anaconda3\lib\site-packages (from pytorch-lightning->ISLP) (4.11.0) Requirement already satisfied: lightning-utilities>=0.10.0 in c:\users\wsher\anaconda3\lib\site-packages (from pytorch-lightning->ISLP) (0.14.0) Requirement already satisfied: filelock in c:\users\wsher\anaconda3\lib\site-packages (from torch->ISLP) (3.13.1) Requirement already satisfied: networkx in c:\users\wsher\anaconda3\lib\site-packages (from torch->ISLP) (3.3) Requirement already satisfied: jinja2 in c:\users\wsher\anaconda3\lib\site-packages (from torch->ISLP) (3.1.4) Requirement already satisfied: setuptools in c:\users\wsher\anaconda3\lib\site-packages (from torch->ISLP) (75.1.0) Requirement already satisfied: sympy==1.13.1 in c:\users\wsher\anaconda3\lib\site-packages (from torch->ISLP) (1.13.1) Requirement already satisfied: mpmath<1.4,>=1.1.0 in c:\users\wsher\anaconda3\lib\site-packages (from sympy==1.13.1->torch->ISLP) (1.3.0) Requirement already satisfied: interface-meta>=1.2.0 in c:\users\wsher\anaconda3\lib\site-packages (from formulaic>=0.2.2->lifelines->ISLP) (1.3.0) Requirement already satisfied: wrapt>=1.0 in c:\users\wsher\anaconda3\lib\site-packages (from formulaic>=0.2.2->lifelines->ISLP) (1.14.1) Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in c:\users\wsher\anaconda3\lib\site-packages (from fsspec[http]>=2022.5.0->pytorch-lightning->ISLP) (3.10.5) Requirement already satisfied: contourpy>=1.0.1 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib>=3.0->lifelines->ISLP) (1.2.0) Requirement already satisfied: cycler>=0.10 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib>=3.0->lifelines->ISLP) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib>=3.0->lifelines->ISLP) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib>=3.0->lifelines->ISLP) (1.4.4) Requirement already satisfied: pillow>=8 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib>=3.0->lifelines->ISLP) (10.4.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\wsher\anaconda3\lib\site-packages (from matplotlib>=3.0->lifelines->ISLP) (3.1.2) Requirement already satisfied: six in c:\users\wsher\anaconda3\lib\site-packages (from patsy>=0.5.6->statsmodels>=0.13->ISLP) (1.16.0) Requirement already satisfied: python-utils>=3.8.1 in c:\users\wsher\anaconda3\lib\site-packages (from progressbar2<5.0.0,>=4.2.0->pygam->ISLP) (3.9.1) Requirement already satisfied: colorama in c:\users\wsher\anaconda3\lib\site-packages (from tqdm>=4.57.0->pytorch-lightning->ISLP) (0.4.6) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\wsher\anaconda3\lib\site-packages (from jinja2->torch->ISLP) (2.1.3) Requirement already satisfied: aiohappyeyeballs>=2.3.0 in c:\users\wsher\anaconda3\lib\site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]>=2022.5.0->pytorch-lightning->ISLP) (2.4.0) Requirement already satisfied: aiosignal>=1.1.2 in c:\users\wsher\anaconda3\lib\site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]>=2022.5.0->pytorch-lightning->ISLP) (1.2.0) Requirement already satisfied: attrs>=17.3.0 in c:\users\wsher\anaconda3\lib\site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]>=2022.5.0->pytorch-lightning->ISLP) (23.1.0) Requirement already satisfied: frozenlist>=1.1.1 in c:\users\wsher\anaconda3\lib\site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]>=2022.5.0->pytorch-lightning->ISLP) (1.4.0) Requirement already satisfied: multidict<7.0,>=4.5 in c:\users\wsher\anaconda3\lib\site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]>=2022.5.0->pytorch-lightning->ISLP) (6.0.4) Requirement already satisfied: yarl<2.0,>=1.0 in c:\users\wsher\anaconda3\lib\site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]>=2022.5.0->pytorch-lightning->ISLP) (1.11.0) Requirement already satisfied: idna>=2.0 in c:\users\wsher\anaconda3\lib\site-packages (from yarl<2.0,>=1.0->aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]>=2022.5.0->pytorch-lightning->ISLP) (3.7)
In [4]:
!pip install scipy
Requirement already satisfied: scipy in c:\users\wsher\anaconda3\lib\site-packages (1.11.4) Requirement already satisfied: numpy<1.28.0,>=1.21.6 in c:\users\wsher\anaconda3\lib\site-packages (from scipy) (1.26.4)
In [5]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
from matplotlib.pyplot import subplots
import seaborn as sns
# Helps on better visalization of graphs
sns.set()
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# These code will bring statmodels library and dependences
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,summarize,poly)
In [4]:
import pandas as pd
# Path to the file in the Downloads folder (adjust the username as needed)
file_path = 'C:/Users/wsher/Downloads/PLANT_SECURITY_SV.xlsx'
# Read the data from the specified sheet
database = pd.read_excel(file_path, sheet_name='DB')
In [5]:
database.head()
Out[5]:
| ID_LINE | AGE_OPERATOR | YEARS_EXP | SENIORITY | EMPLOYEE_CAT | HOURS_OFTRAINING_SECURITY | HOURS_OFTRAINING_POSITION | GRADE_TEOREXAM | GRADE_PRACTICALEXAM | NUMBER_ILLS | ... | USE_ADEQTOOLS | SUFFER?ANXIETY | EXPOSED_QUIM | SCORE_ILLUM.1 | AVAILABLE_SPACE | SCORE_FATIGUE | EVAL_TIMEAVAIL | EVAL_KNOWSUFFIC | TEMP_PLACEOFWORK | ACA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 189399M851 | 36 | 6 | 6 | 6 | 14.5 | 38 | 100 | 100 | 3 | ... | 1 | 1 | 1 | 4 | 4 | 3 | 4 | 4 | 36.7 | 0 |
| 1 | 2133265M301 | 19 | 1 | 1 | 6 | 14.5 | 22 | 95 | 95 | 1 | ... | 1 | 0 | 1 | 4 | 4 | 2 | 4 | 4 | 36.3 | 0 |
| 2 | 32695VZF81 | 39 | 10 | 21 | 6 | 14.5 | 38 | 100 | 100 | 2 | ... | 1 | 0 | 4 | 4 | 3 | 4 | 4 | 5 | 36.3 | 0 |
| 3 | 4147823VZ81 | 22 | 1 | 1 | 7 | 14.5 | 10 | 90 | 90 | 2 | ... | 1 | 0 | 3 | 5 | 5 | 5 | 5 | 5 | 36.3 | 4 |
| 4 | 5106984MZV7/1 1 | 26 | 1 | 4 | 6 | 14.5 | 38 | 100 | 100 | 1 | ... | 1 | 0 | 2 | 4 | 4 | 2 | 4 | 4 | 36.3 | 0 |
5 rows × 27 columns
In [6]:
database.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 865 entries, 0 to 864 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID_LINE 865 non-null object 1 AGE_OPERATOR 865 non-null int64 2 YEARS_EXP 865 non-null int64 3 SENIORITY 865 non-null int64 4 EMPLOYEE_CAT 865 non-null int64 5 HOURS_OFTRAINING_SECURITY 865 non-null float64 6 HOURS_OFTRAINING_POSITION 865 non-null int64 7 GRADE_TEOREXAM 865 non-null int64 8 GRADE_PRACTICALEXAM 865 non-null int64 9 NUMBER_ILLS 865 non-null int64 10 SCORE_RISKOFMACH 865 non-null float64 11 SCORE_ILLUM 865 non-null float64 12 NOISE_ATPLACE 865 non-null float64 13 NUMBER_EXTRAHOURS 865 non-null float64 14 NUMBER_RESTHOURS 865 non-null float64 15 SCORE_HIDRAT 865 non-null int64 16 USE_PPE 865 non-null int64 17 USE_ADEQTOOLS 865 non-null int64 18 SUFFER?ANXIETY 865 non-null int64 19 EXPOSED_QUIM 865 non-null int64 20 SCORE_ILLUM.1 865 non-null int64 21 AVAILABLE_SPACE 865 non-null int64 22 SCORE_FATIGUE 865 non-null int64 23 EVAL_TIMEAVAIL 865 non-null int64 24 EVAL_KNOWSUFFIC 865 non-null int64 25 TEMP_PLACEOFWORK 865 non-null float64 26 ACA 865 non-null int64 dtypes: float64(7), int64(19), object(1) memory usage: 182.6+ KB
In [7]:
# checking shape of the data
print("There are", database.shape[0], 'rows and', database.shape[1], "columns.")
There are 865 rows and 27 columns.
In [8]:
# List of columns to include in the new DataFrame 'db'
columns_to_include = [
'AGE_OPERATOR', 'YEARS_EXP', 'SENIORITY', 'EMPLOYEE_CAT',
'HOURS_OFTRAINING_SECURITY', 'HOURS_OFTRAINING_POSITION',
'GRADE_TEOREXAM', 'GRADE_PRACTICALEXAM', 'NUMBER_ILLS',
'SCORE_RISKOFMACH', 'SCORE_ILLUM', 'NOISE_ATPLACE',
'NUMBER_EXTRAHOURS', 'NUMBER_RESTHOURS', 'SCORE_HIDRAT',
'USE_PPE', 'USE_ADEQTOOLS', 'SUFFER?ANXIETY', 'EXPOSED_QUIM',
'AVAILABLE_SPACE', 'SCORE_FATIGUE', 'EVAL_TIMEAVAIL',
'EVAL_KNOWSUFFIC', 'TEMP_PLACEOFWORK', 'ACA'
]
# Create the new DataFrame 'db' with only the selected columns
data = database[columns_to_include].copy()
In [9]:
data.head()
Out[9]:
| AGE_OPERATOR | YEARS_EXP | SENIORITY | EMPLOYEE_CAT | HOURS_OFTRAINING_SECURITY | HOURS_OFTRAINING_POSITION | GRADE_TEOREXAM | GRADE_PRACTICALEXAM | NUMBER_ILLS | SCORE_RISKOFMACH | ... | USE_PPE | USE_ADEQTOOLS | SUFFER?ANXIETY | EXPOSED_QUIM | AVAILABLE_SPACE | SCORE_FATIGUE | EVAL_TIMEAVAIL | EVAL_KNOWSUFFIC | TEMP_PLACEOFWORK | ACA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36 | 6 | 6 | 6 | 14.5 | 38 | 100 | 100 | 3 | 789.0 | ... | 1 | 1 | 1 | 1 | 4 | 3 | 4 | 4 | 36.7 | 0 |
| 1 | 19 | 1 | 1 | 6 | 14.5 | 22 | 95 | 95 | 1 | 789.0 | ... | 1 | 1 | 0 | 1 | 4 | 2 | 4 | 4 | 36.3 | 0 |
| 2 | 39 | 10 | 21 | 6 | 14.5 | 38 | 100 | 100 | 2 | 868.0 | ... | 1 | 1 | 0 | 4 | 3 | 4 | 4 | 5 | 36.3 | 0 |
| 3 | 22 | 1 | 1 | 7 | 14.5 | 10 | 90 | 90 | 2 | 868.0 | ... | 1 | 1 | 0 | 3 | 5 | 5 | 5 | 5 | 36.3 | 4 |
| 4 | 26 | 1 | 4 | 6 | 14.5 | 38 | 100 | 100 | 1 | 1072.0 | ... | 1 | 1 | 0 | 2 | 4 | 2 | 4 | 4 | 36.3 | 0 |
5 rows × 25 columns
In [10]:
data.tail()
Out[10]:
| AGE_OPERATOR | YEARS_EXP | SENIORITY | EMPLOYEE_CAT | HOURS_OFTRAINING_SECURITY | HOURS_OFTRAINING_POSITION | GRADE_TEOREXAM | GRADE_PRACTICALEXAM | NUMBER_ILLS | SCORE_RISKOFMACH | ... | USE_PPE | USE_ADEQTOOLS | SUFFER?ANXIETY | EXPOSED_QUIM | AVAILABLE_SPACE | SCORE_FATIGUE | EVAL_TIMEAVAIL | EVAL_KNOWSUFFIC | TEMP_PLACEOFWORK | ACA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 860 | 26 | 1 | 4 | 6 | 14.5 | 38 | 100 | 100 | 1 | 1072.0 | ... | 1 | 1 | 0 | 2 | 4 | 2 | 4 | 4 | 35.5 | 0 |
| 861 | 28 | 1 | 1 | 2 | 14.5 | 20 | 100 | 100 | 0 | 623.8 | ... | 1 | 1 | 0 | 3 | 4 | 2 | 4 | 4 | 35.3 | 0 |
| 862 | 55 | 10 | 15 | 6 | 14.5 | 38 | 90 | 90 | 1 | 623.9 | ... | 1 | 1 | 0 | 2 | 5 | 2 | 2 | 2 | 35.3 | 0 |
| 863 | 24 | 1 | 1 | 4 | 14.5 | 22 | 100 | 100 | 1 | 1072.0 | ... | 1 | 1 | 0 | 1 | 5 | 1 | 5 | 4 | 35.4 | 0 |
| 864 | 36 | 1 | 6 | 6 | 14.5 | 38 | 100 | 100 | 2 | 1072.0 | ... | 1 | 1 | 0 | 3 | 5 | 2 | 5 | 5 | 35.4 | 0 |
5 rows × 25 columns
In [11]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 865 entries, 0 to 864 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AGE_OPERATOR 865 non-null int64 1 YEARS_EXP 865 non-null int64 2 SENIORITY 865 non-null int64 3 EMPLOYEE_CAT 865 non-null int64 4 HOURS_OFTRAINING_SECURITY 865 non-null float64 5 HOURS_OFTRAINING_POSITION 865 non-null int64 6 GRADE_TEOREXAM 865 non-null int64 7 GRADE_PRACTICALEXAM 865 non-null int64 8 NUMBER_ILLS 865 non-null int64 9 SCORE_RISKOFMACH 865 non-null float64 10 SCORE_ILLUM 865 non-null float64 11 NOISE_ATPLACE 865 non-null float64 12 NUMBER_EXTRAHOURS 865 non-null float64 13 NUMBER_RESTHOURS 865 non-null float64 14 SCORE_HIDRAT 865 non-null int64 15 USE_PPE 865 non-null int64 16 USE_ADEQTOOLS 865 non-null int64 17 SUFFER?ANXIETY 865 non-null int64 18 EXPOSED_QUIM 865 non-null int64 19 AVAILABLE_SPACE 865 non-null int64 20 SCORE_FATIGUE 865 non-null int64 21 EVAL_TIMEAVAIL 865 non-null int64 22 EVAL_KNOWSUFFIC 865 non-null int64 23 TEMP_PLACEOFWORK 865 non-null float64 24 ACA 865 non-null int64 dtypes: float64(7), int64(18) memory usage: 169.1 KB
In [29]:
# checking for missing values in the data
data.isnull().sum()
Out[29]:
ID_LINE 0 AGE_OPERATOR 0 YEARS_EXP 0 SENIORITY 0 EMPLOYEE_CAT 0 HOURS_OFTRAINING_SECURITY 0 HOURS_OFTRAINING_POSITION 0 GRADE_TEOREXAM 0 GRADE_PRACTICALEXAM 0 NUMBER_ILLS 0 SCORE_RISKOFMACH 0 SCORE_ILLUM 0 NOISE_ATPLACE 0 NUMBER_EXTRAHOURS 0 NUMBER_RESTHOURS 0 SCORE_HIDRAT 0 USE_PPE 0 USE_ADEQTOOLS 0 SUFFER?ANXIETY 0 EXPOSED_QUIM 0 SCORE_ILLUM.1 0 AVAILABLE_SPACE 0 SCORE_FATIGUE 0 EVAL_TIMEAVAIL 0 EVAL_KNOWSUFFIC 0 TEMP_PLACEOFWORK 0 ACA 0 dtype: int64
In [30]:
# Let's look at the statistical summary of the data
data.describe().T
Out[30]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| AGE_OPERATOR | 865.0 | 35.958382 | 9.693241 | 19.0 | 28.0 | 34.0 | 44.0 | 57.0 |
| YEARS_EXP | 865.0 | 3.579191 | 3.100333 | 1.0 | 1.0 | 2.0 | 6.0 | 10.0 |
| SENIORITY | 865.0 | 7.972254 | 6.562320 | 1.0 | 2.0 | 7.0 | 15.0 | 21.0 |
| EMPLOYEE_CAT | 865.0 | 5.395376 | 1.248509 | 2.0 | 5.0 | 6.0 | 6.0 | 7.0 |
| HOURS_OFTRAINING_SECURITY | 865.0 | 14.500000 | 0.000000 | 14.5 | 14.5 | 14.5 | 14.5 | 14.5 |
| HOURS_OFTRAINING_POSITION | 865.0 | 33.535260 | 7.859382 | 10.0 | 32.0 | 38.0 | 38.0 | 38.0 |
| GRADE_TEOREXAM | 865.0 | 97.115607 | 5.020610 | 80.0 | 95.0 | 100.0 | 100.0 | 100.0 |
| GRADE_PRACTICALEXAM | 865.0 | 97.358382 | 4.911120 | 80.0 | 95.0 | 100.0 | 100.0 | 100.0 |
| NUMBER_ILLS | 865.0 | 1.658960 | 1.154898 | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 |
| SCORE_RISKOFMACH | 865.0 | 821.669503 | 170.550540 | 623.1 | 623.8 | 789.0 | 868.0 | 1072.0 |
| SCORE_ILLUM | 865.0 | 93.538289 | 49.781323 | 13.0 | 40.0 | 114.0 | 121.0 | 164.0 |
| NOISE_ATPLACE | 865.0 | 84.492520 | 1.325648 | 82.0 | 83.8 | 84.0 | 85.0 | 88.0 |
| NUMBER_EXTRAHOURS | 865.0 | 0.524855 | 2.134856 | 0.0 | 0.0 | 0.0 | 0.0 | 15.3 |
| NUMBER_RESTHOURS | 865.0 | 0.141965 | 1.004473 | 0.0 | 0.0 | 0.0 | 0.0 | 8.1 |
| SCORE_HIDRAT | 865.0 | 4.954913 | 2.462506 | 0.0 | 3.0 | 5.0 | 6.0 | 15.0 |
| USE_PPE | 865.0 | 0.876301 | 0.329429 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| USE_ADEQTOOLS | 865.0 | 0.708671 | 0.454638 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| SUFFER?ANXIETY | 865.0 | 0.223121 | 0.416580 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| EXPOSED_QUIM | 865.0 | 2.395376 | 1.197403 | 1.0 | 1.0 | 2.0 | 3.0 | 5.0 |
| SCORE_ILLUM.1 | 865.0 | 4.223121 | 0.745308 | 2.0 | 4.0 | 4.0 | 5.0 | 5.0 |
| AVAILABLE_SPACE | 865.0 | 4.083237 | 0.868699 | 1.0 | 4.0 | 4.0 | 5.0 | 5.0 |
| SCORE_FATIGUE | 865.0 | 2.690173 | 1.300359 | 1.0 | 2.0 | 2.0 | 4.0 | 5.0 |
| EVAL_TIMEAVAIL | 865.0 | 3.937572 | 0.823996 | 1.0 | 4.0 | 4.0 | 4.0 | 5.0 |
| EVAL_KNOWSUFFIC | 865.0 | 3.953757 | 0.961137 | 1.0 | 4.0 | 4.0 | 5.0 | 5.0 |
| TEMP_PLACEOFWORK | 865.0 | 36.537688 | 1.313442 | 33.7 | 35.4 | 37.0 | 37.5 | 39.0 |
| ACA | 865.0 | 1.026590 | 2.825641 | 0.0 | 0.0 | 0.0 | 1.0 | 63.0 |
In [19]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Define the columns that need scaling
columns_to_scale = ['SCORE_RISKOFMACH', 'SCORE_ILLUM', 'TEMP_PLACEOFWORK', 'ACA']
# Apply Standardization (Mean = 0, Std Dev = 1)
scaler = StandardScaler()
database[columns_to_scale] = scaler.fit_transform(database[columns_to_scale])
# Check the new statistics
print(database.describe())
AGE_OPERATOR YEARS_EXP SENIORITY EMPLOYEE_CAT \
count 865.000000 865.000000 865.000000 865.000000
mean 35.958382 3.579191 7.972254 5.395376
std 9.693241 3.100333 6.562320 1.248509
min 19.000000 1.000000 1.000000 2.000000
25% 28.000000 1.000000 2.000000 5.000000
50% 34.000000 2.000000 7.000000 6.000000
75% 44.000000 6.000000 15.000000 6.000000
max 57.000000 10.000000 21.000000 7.000000
HOURS_OFTRAINING_SECURITY HOURS_OFTRAINING_POSITION GRADE_TEOREXAM \
count 865.0 865.000000 865.000000
mean 14.5 33.535260 97.115607
std 0.0 7.859382 5.020610
min 14.5 10.000000 80.000000
25% 14.5 32.000000 95.000000
50% 14.5 38.000000 100.000000
75% 14.5 38.000000 100.000000
max 14.5 38.000000 100.000000
GRADE_PRACTICALEXAM NUMBER_ILLS SCORE_RISKOFMACH SCORE_ILLUM \
count 865.000000 865.000000 8.650000e+02 8.650000e+02
mean 97.358382 1.658960 9.328440e-16 2.587526e-16
std 4.911120 1.154898 1.000579e+00 1.000579e+00
min 80.000000 0.000000 -1.164959e+00 -1.618777e+00
25% 95.000000 1.000000 -1.160852e+00 -1.076092e+00
50% 100.000000 2.000000 -1.916640e-01 4.112697e-01
75% 100.000000 3.000000 2.718098e-01 5.519660e-01
max 100.000000 4.000000 1.468628e+00 1.416243e+00
NOISE_ATPLACE NUMBER_EXTRAHOURS NUMBER_RESTHOURS SCORE_HIDRAT \
count 865.000000 865.000000 865.000000 865.000000
mean 84.492520 0.524855 0.141965 4.954913
std 1.325648 2.134856 1.004473 2.462506
min 82.000000 0.000000 0.000000 0.000000
25% 83.800000 0.000000 0.000000 3.000000
50% 84.000000 0.000000 0.000000 5.000000
75% 85.000000 0.000000 0.000000 6.000000
max 88.000000 15.300000 8.100000 15.000000
USE_PPE USE_ADEQTOOLS SUFFER?ANXIETY EXPOSED_QUIM SCORE_ILLUM.1 \
count 865.000000 865.000000 865.000000 865.000000 865.000000
mean 0.876301 0.708671 0.223121 2.395376 4.223121
std 0.329429 0.454638 0.416580 1.197403 0.745308
min 0.000000 0.000000 0.000000 1.000000 2.000000
25% 1.000000 0.000000 0.000000 1.000000 4.000000
50% 1.000000 1.000000 0.000000 2.000000 4.000000
75% 1.000000 1.000000 0.000000 3.000000 5.000000
max 1.000000 1.000000 1.000000 5.000000 5.000000
AVAILABLE_SPACE SCORE_FATIGUE EVAL_TIMEAVAIL EVAL_KNOWSUFFIC \
count 865.000000 865.000000 865.000000 865.000000
mean 4.083237 2.690173 3.937572 3.953757
std 0.868699 1.300359 0.823996 0.961137
min 1.000000 1.000000 1.000000 1.000000
25% 4.000000 2.000000 4.000000 4.000000
50% 4.000000 2.000000 4.000000 4.000000
75% 5.000000 4.000000 4.000000 5.000000
max 5.000000 5.000000 5.000000 5.000000
TEMP_PLACEOFWORK ACA
count 8.650000e+02 8.650000e+02
mean -1.314299e-15 4.107183e-17
std 1.000579e+00 1.000579e+00
min -2.161748e+00 -3.635223e-01
25% -8.666895e-01 -3.635223e-01
50% 3.521889e-01 -3.635223e-01
75% 7.330884e-01 -9.415556e-03
max 1.875787e+00 2.194520e+01
In [12]:
# Let's look at the statistical summary of the data
data.describe().T
Out[12]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| AGE_OPERATOR | 865.0 | 35.958382 | 9.693241 | 19.0 | 28.0 | 34.0 | 44.0 | 57.0 |
| YEARS_EXP | 865.0 | 3.579191 | 3.100333 | 1.0 | 1.0 | 2.0 | 6.0 | 10.0 |
| SENIORITY | 865.0 | 7.972254 | 6.562320 | 1.0 | 2.0 | 7.0 | 15.0 | 21.0 |
| EMPLOYEE_CAT | 865.0 | 5.395376 | 1.248509 | 2.0 | 5.0 | 6.0 | 6.0 | 7.0 |
| HOURS_OFTRAINING_SECURITY | 865.0 | 14.500000 | 0.000000 | 14.5 | 14.5 | 14.5 | 14.5 | 14.5 |
| HOURS_OFTRAINING_POSITION | 865.0 | 33.535260 | 7.859382 | 10.0 | 32.0 | 38.0 | 38.0 | 38.0 |
| GRADE_TEOREXAM | 865.0 | 97.115607 | 5.020610 | 80.0 | 95.0 | 100.0 | 100.0 | 100.0 |
| GRADE_PRACTICALEXAM | 865.0 | 97.358382 | 4.911120 | 80.0 | 95.0 | 100.0 | 100.0 | 100.0 |
| NUMBER_ILLS | 865.0 | 1.658960 | 1.154898 | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 |
| SCORE_RISKOFMACH | 865.0 | 821.669503 | 170.550540 | 623.1 | 623.8 | 789.0 | 868.0 | 1072.0 |
| SCORE_ILLUM | 865.0 | 93.538289 | 49.781323 | 13.0 | 40.0 | 114.0 | 121.0 | 164.0 |
| NOISE_ATPLACE | 865.0 | 84.492520 | 1.325648 | 82.0 | 83.8 | 84.0 | 85.0 | 88.0 |
| NUMBER_EXTRAHOURS | 865.0 | 0.524855 | 2.134856 | 0.0 | 0.0 | 0.0 | 0.0 | 15.3 |
| NUMBER_RESTHOURS | 865.0 | 0.141965 | 1.004473 | 0.0 | 0.0 | 0.0 | 0.0 | 8.1 |
| SCORE_HIDRAT | 865.0 | 4.954913 | 2.462506 | 0.0 | 3.0 | 5.0 | 6.0 | 15.0 |
| USE_PPE | 865.0 | 0.876301 | 0.329429 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| USE_ADEQTOOLS | 865.0 | 0.708671 | 0.454638 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| SUFFER?ANXIETY | 865.0 | 0.223121 | 0.416580 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| EXPOSED_QUIM | 865.0 | 2.395376 | 1.197403 | 1.0 | 1.0 | 2.0 | 3.0 | 5.0 |
| AVAILABLE_SPACE | 865.0 | 4.083237 | 0.868699 | 1.0 | 4.0 | 4.0 | 5.0 | 5.0 |
| SCORE_FATIGUE | 865.0 | 2.690173 | 1.300359 | 1.0 | 2.0 | 2.0 | 4.0 | 5.0 |
| EVAL_TIMEAVAIL | 865.0 | 3.937572 | 0.823996 | 1.0 | 4.0 | 4.0 | 4.0 | 5.0 |
| EVAL_KNOWSUFFIC | 865.0 | 3.953757 | 0.961137 | 1.0 | 4.0 | 4.0 | 5.0 | 5.0 |
| TEMP_PLACEOFWORK | 865.0 | 36.537688 | 1.313442 | 33.7 | 35.4 | 37.0 | 37.5 | 39.0 |
| ACA | 865.0 | 1.026590 | 2.825641 | 0.0 | 0.0 | 0.0 | 1.0 | 63.0 |
In [16]:
# Let´s define a functions to visualize the cahateristicas of our variables
def histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
In [18]:
# Iterate over each column in the dataframe and apply the histogram_boxplot function
import matplotlib.pyplot as plt
import seaborn as sns
for column in data.columns:
histogram_boxplot(data, column)
plt.show()
In [ ]:
Model- Boxplot and Histogram
The analysis of the data reveals various trends across multiple variables. The age distribution of operators shows a concentration between 20-35 years, with a smaller number in the 42-55 age range. When it comes to years of experience, most employees have between 1-3 years of experience, while a few have 6-10 years. The seniority levels are also skewed, with the majority of employees having 0-7.5 years of seniority, and fewer employees falling in the 15-20 year range.
In terms of employee categories, the majority of employees fall under category 6. The number of hours of security training shows little variation, mostly ranging between 14 and 15 hours, while the hours of position training are more varied. Most employees received between 30 and 40 hours of position training, though some had only 10 hours, and a slightly higher amount received between 20-23 hours.
Regarding grading metrics, the distribution of grades to reexamine shows that grade 100 had around 600 occurrences, grade 95 had about 50 occurrences, and grade 90 had about 200 occurrences. For the practical exam, grade 90 had approximately 200 occurrences, grade 95 had 50 occurrences, and grade 100 had about 600 occurrences.
The number of illnesses showed a significant spread: between 0 and 0.5 illnesses occurred 150 times, between 0.75 and 1.1 illnesses occurred 250 times, 2.0 illnesses occurred 250 times, 3.0 illnesses occurred 175 times, and 3.6 illnesses occurred 50 times. Lastly, the score for the risk of machines showed that scores between 200-400 occurred 275 times, scores of 800 occurred 150 times, and scores near 900 appeared 200 times.
This overall pattern shows that most variables exhibit clear trends with a few outliers. For example, the number of illnesses is widely spread, while the training hours for positions and grade distributions suggest more concentrated ranges, indicating common levels of performance and training among employees.
In [19]:
plt.figure(figsize=(15, 7))
sns.heatmap(
data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
In [ ]:
Model 7 heat map
Heatmap Insights:
Looking at the heatmap, I noticed that "hidra" and "eval_knowsuffice" seem to be more correlated than before, which might mean they affect each other more than I originally thought. Also, there’s a visible relationship between "hours of training" and "age of operator", which is kind of interesting because it suggests that older operators might have different training patterns compared to younger ones.
These observations could help improve how we optimize the models—like tweaking hyperparameters based on which variables are strongly correlated. It might lead to better results, especially in terms of RMSE, R², MAE for regression models, and accuracy, TP, TN, FP, FN for classification models.
In [ ]:
Exploratory Analysis :This code uses the StandardScaler from sklearn.preprocessing to standardize specific numerical columns in a dataset named database. The columns selected for scaling are 'SCORE_RISKOFMACH', 'SCORE_ILLUM', 'TEMP_PLACEOFWORK', and 'ACA'. Standardization transforms these features to have a mean of 0 and a standard deviation of 1, which helps improve the performance of machine learning models by ensuring that all features contribute equally. The scaler.fit_transform() method is applied to the selected columns, replacing their original values with the standardized ones. Finally, the print(database.describe()) command is used to display summary statistics of the updated dataset, allowing verification of the applied transformation.
In [24]:
sns.pairplot(data=data)
Out[24]:
<seaborn.axisgrid.PairGrid at 0x1b5ac53b3e0>
In [ ]:
Model- SNS pairplot showed some correlation between value but I found it difficult to analyze the information;
Insights from Pairplot:
The pairplot has been really helpful in visualizing the relationships between the different variables. It’s clear that some pairs of features, like "feature_x" and "feature_y", are strongly correlated, which could be useful for our analysis. Other pairs show more scattered relationships, which might suggest weaker associations.
For categorical outcomes, the pairplot can show us how well the different categories separate. If the points for each class are spread out in a way that makes it easy to distinguish them, then the model might be working well. If not, we may need to rethink our features or model choices.
For the numerical outcomes, the pairplot helps identify trends or groupings that could lead to better predictive performance. It gives us a visual representation of the data that might point us toward adjusting the model to improve predictions.
Based on the patterns in the pairplot and the hyperparameter tuning, we can see that Model 1 might benefit from some additional tuning or even feature engineering to better capture the relationships between the variables. This analysis of the pairplot and the metrics helps us better understand which features are most important, which can guide further improvements to our models.
In [25]:
# Create the set of independent variables (X) and the dependent variable (y)
X = data.drop(["SCORE_FATIGUE"], axis=1) # Drop the target variable from the features
y = data["SCORE_FATIGUE"] # Set the dependent variable
In [26]:
X
Out[26]:
| AGE_OPERATOR | YEARS_EXP | SENIORITY | EMPLOYEE_CAT | HOURS_OFTRAINING_SECURITY | HOURS_OFTRAINING_POSITION | GRADE_TEOREXAM | GRADE_PRACTICALEXAM | NUMBER_ILLS | SCORE_RISKOFMACH | ... | SCORE_HIDRAT | USE_PPE | USE_ADEQTOOLS | SUFFER?ANXIETY | EXPOSED_QUIM | AVAILABLE_SPACE | EVAL_TIMEAVAIL | EVAL_KNOWSUFFIC | TEMP_PLACEOFWORK | ACA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36 | 6 | 6 | 6 | 14.5 | 38 | 100 | 100 | 3 | 789.0 | ... | 3 | 1 | 1 | 1 | 1 | 4 | 4 | 4 | 36.7 | 0 |
| 1 | 19 | 1 | 1 | 6 | 14.5 | 22 | 95 | 95 | 1 | 789.0 | ... | 7 | 1 | 1 | 0 | 1 | 4 | 4 | 4 | 36.3 | 0 |
| 2 | 39 | 10 | 21 | 6 | 14.5 | 38 | 100 | 100 | 2 | 868.0 | ... | 6 | 1 | 1 | 0 | 4 | 3 | 4 | 5 | 36.3 | 0 |
| 3 | 22 | 1 | 1 | 7 | 14.5 | 10 | 90 | 90 | 2 | 868.0 | ... | 8 | 1 | 1 | 0 | 3 | 5 | 5 | 5 | 36.3 | 4 |
| 4 | 26 | 1 | 4 | 6 | 14.5 | 38 | 100 | 100 | 1 | 1072.0 | ... | 5 | 1 | 1 | 0 | 2 | 4 | 4 | 4 | 36.3 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 860 | 26 | 1 | 4 | 6 | 14.5 | 38 | 100 | 100 | 1 | 1072.0 | ... | 3 | 1 | 1 | 0 | 2 | 4 | 4 | 4 | 35.5 | 0 |
| 861 | 28 | 1 | 1 | 2 | 14.5 | 20 | 100 | 100 | 0 | 623.8 | ... | 5 | 1 | 1 | 0 | 3 | 4 | 4 | 4 | 35.3 | 0 |
| 862 | 55 | 10 | 15 | 6 | 14.5 | 38 | 90 | 90 | 1 | 623.9 | ... | 4 | 1 | 1 | 0 | 2 | 5 | 2 | 2 | 35.3 | 0 |
| 863 | 24 | 1 | 1 | 4 | 14.5 | 22 | 100 | 100 | 1 | 1072.0 | ... | 3 | 1 | 1 | 0 | 1 | 5 | 5 | 4 | 35.4 | 0 |
| 864 | 36 | 1 | 6 | 6 | 14.5 | 38 | 100 | 100 | 2 | 1072.0 | ... | 6 | 1 | 1 | 0 | 3 | 5 | 5 | 5 | 35.4 | 0 |
865 rows × 24 columns
In [28]:
y
Out[28]:
0 3
1 2
2 4
3 5
4 2
..
860 2
861 2
862 2
863 1
864 2
Name: SCORE_FATIGUE, Length: 865, dtype: int64
In [29]:
type(y)
Out[29]:
pandas.core.series.Series
In [32]:
import statsmodels.api as sm
In [33]:
#Running the model
# Add a constant term to the predictor variable set
# statmodelS does not do by default
import statsmodels.api as sm
# Add a constant term to the predictor variable set
X_with_constant = sm.add_constant(X)
# Running the OLS model
model1 = sm.OLS(y, X_with_constant)
results1 = model1.fit()
# Summarize the results
print(results1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: SCORE_FATIGUE R-squared: 0.423
Model: OLS Adj. R-squared: 0.407
Method: Least Squares F-statistic: 26.81
Date: Sun, 09 Mar 2025 Prob (F-statistic): 1.56e-84
Time: 16:01:07 Log-Likelihood: -1216.2
No. Observations: 865 AIC: 2480.
Df Residuals: 841 BIC: 2595.
Df Model: 23
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
AGE_OPERATOR -0.0131 0.005 -2.473 0.014 -0.023 -0.003
YEARS_EXP -0.0150 0.015 -0.986 0.324 -0.045 0.015
SENIORITY -0.0016 0.009 -0.171 0.865 -0.020 0.017
EMPLOYEE_CAT 0.2008 0.047 4.242 0.000 0.108 0.294
HOURS_OFTRAINING_SECURITY -0.0177 0.184 -0.096 0.924 -0.379 0.344
HOURS_OFTRAINING_POSITION -0.0147 0.009 -1.641 0.101 -0.032 0.003
GRADE_TEOREXAM 0.0573 0.026 2.241 0.025 0.007 0.108
GRADE_PRACTICALEXAM -0.0688 0.027 -2.514 0.012 -0.122 -0.015
NUMBER_ILLS 0.0228 0.038 0.600 0.549 -0.052 0.097
SCORE_RISKOFMACH -0.0011 0.000 -4.048 0.000 -0.002 -0.001
SCORE_ILLUM 0.0003 0.001 0.396 0.692 -0.001 0.002
NOISE_ATPLACE -0.0089 0.029 -0.304 0.761 -0.066 0.048
NUMBER_EXTRAHOURS 0.0253 0.017 1.501 0.134 -0.008 0.058
NUMBER_RESTHOURS 0.0454 0.035 1.282 0.200 -0.024 0.115
SCORE_HIDRAT 0.1331 0.016 8.336 0.000 0.102 0.164
USE_PPE 0.0935 0.130 0.718 0.473 -0.162 0.349
USE_ADEQTOOLS -0.4447 0.114 -3.901 0.000 -0.669 -0.221
SUFFER?ANXIETY 0.7777 0.105 7.384 0.000 0.571 0.984
EXPOSED_QUIM 0.3987 0.038 10.514 0.000 0.324 0.473
AVAILABLE_SPACE 0.0936 0.056 1.675 0.094 -0.016 0.203
EVAL_TIMEAVAIL -0.0229 0.054 -0.428 0.669 -0.128 0.082
EVAL_KNOWSUFFIC -0.0452 0.045 -0.999 0.318 -0.134 0.044
TEMP_PLACEOFWORK 0.1077 0.029 3.759 0.000 0.051 0.164
ACA 0.0112 0.013 0.887 0.375 -0.014 0.036
==============================================================================
Omnibus: 11.952 Durbin-Watson: 1.802
Prob(Omnibus): 0.003 Jarque-Bera (JB): 12.300
Skew: 0.287 Prob(JB): 0.00213
Kurtosis: 2.890 Cond. No. 4.78e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.78e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [34]:
# Use the summary() method on the results1 object to see a more
# detailed report
print(results1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: SCORE_FATIGUE R-squared: 0.423
Model: OLS Adj. R-squared: 0.407
Method: Least Squares F-statistic: 26.81
Date: Sun, 09 Mar 2025 Prob (F-statistic): 1.56e-84
Time: 16:02:21 Log-Likelihood: -1216.2
No. Observations: 865 AIC: 2480.
Df Residuals: 841 BIC: 2595.
Df Model: 23
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
AGE_OPERATOR -0.0131 0.005 -2.473 0.014 -0.023 -0.003
YEARS_EXP -0.0150 0.015 -0.986 0.324 -0.045 0.015
SENIORITY -0.0016 0.009 -0.171 0.865 -0.020 0.017
EMPLOYEE_CAT 0.2008 0.047 4.242 0.000 0.108 0.294
HOURS_OFTRAINING_SECURITY -0.0177 0.184 -0.096 0.924 -0.379 0.344
HOURS_OFTRAINING_POSITION -0.0147 0.009 -1.641 0.101 -0.032 0.003
GRADE_TEOREXAM 0.0573 0.026 2.241 0.025 0.007 0.108
GRADE_PRACTICALEXAM -0.0688 0.027 -2.514 0.012 -0.122 -0.015
NUMBER_ILLS 0.0228 0.038 0.600 0.549 -0.052 0.097
SCORE_RISKOFMACH -0.0011 0.000 -4.048 0.000 -0.002 -0.001
SCORE_ILLUM 0.0003 0.001 0.396 0.692 -0.001 0.002
NOISE_ATPLACE -0.0089 0.029 -0.304 0.761 -0.066 0.048
NUMBER_EXTRAHOURS 0.0253 0.017 1.501 0.134 -0.008 0.058
NUMBER_RESTHOURS 0.0454 0.035 1.282 0.200 -0.024 0.115
SCORE_HIDRAT 0.1331 0.016 8.336 0.000 0.102 0.164
USE_PPE 0.0935 0.130 0.718 0.473 -0.162 0.349
USE_ADEQTOOLS -0.4447 0.114 -3.901 0.000 -0.669 -0.221
SUFFER?ANXIETY 0.7777 0.105 7.384 0.000 0.571 0.984
EXPOSED_QUIM 0.3987 0.038 10.514 0.000 0.324 0.473
AVAILABLE_SPACE 0.0936 0.056 1.675 0.094 -0.016 0.203
EVAL_TIMEAVAIL -0.0229 0.054 -0.428 0.669 -0.128 0.082
EVAL_KNOWSUFFIC -0.0452 0.045 -0.999 0.318 -0.134 0.044
TEMP_PLACEOFWORK 0.1077 0.029 3.759 0.000 0.051 0.164
ACA 0.0112 0.013 0.887 0.375 -0.014 0.036
==============================================================================
Omnibus: 11.952 Durbin-Watson: 1.802
Prob(Omnibus): 0.003 Jarque-Bera (JB): 12.300
Skew: 0.287 Prob(JB): 0.00213
Kurtosis: 2.890 Cond. No. 4.78e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.78e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [ ]:
OLS Regression Results
The model summary for the dependent variable SCORE_FATIGUE reveals an R-squared of 0.423, indicating that the model explains approximately 42.3% of the variance in fatigue scores, which is considered moderate. The adjusted R-squared value of 0.407 takes into account the number of predictors and adjusts for the inclusion of irrelevant variables. The F-statistic of 26.81, with a p-value of 1.56e-84, indicates that the model as a whole is statistically significant. The number of observations in the model is 865, and the AIC and BIC values are 2480 and 2595, respectively.
Several key variables were found to significantly affect the fatigue score. These include EMPLOYEE_CAT, which has a positive effect on fatigue, GRADE_TEOREXAM with a positive effect, GRADE_PRACTICALEXAM with a negative effect, SCORE_RISKOFMACH with a negative effect, SCORE_HIDRAT with a positive effect, USE_ADEQTOOLS with a negative effect, SUFFER?ANXIETY with a positive effect, EXPOSED_QUIM with a positive effect, and TEMP_PLACEOFWORK with a positive effect. On the other hand, variables such as YEARS_EXP, SENIORITY, NUMBER_ILLS, NOISE_ATPLACE, AVAILABLE_SPACE, EVAL_TIMEAVAIL, EVAL_KNOWSUFFIC, and ACA were not statistically significant, as their p-values were greater than 0.05, suggesting they do not have a significant impact on the dependent variable in this model.
From a diagnostic perspective, the Omnibus test p-value of 0.003 suggests that the residuals are not perfectly normally distributed, and the Durbin-Watson statistic of 1.802 indicates a relatively low likelihood of autocorrelation in the residuals. The Jarque-Bera test, with a p-value of 0.00213, further confirms that the residuals deviate from normality.
Overall, this regression model serves as a good starting point for understanding factors that contribute to fatigue. It highlights several significant predictors, particularly related to employee characteristics and environmental factors. However, further refinement of the model may be necessary to improve the fit and address the non-significant variables.
In [ ]:
The analysis reveals that not all variables equally explain the dependent variable, SCORE_FATIGUE. The key contributors include:
AGE_OPERATOR
EMPLOYEE_CAT
GRADE_TEOREXAM
GRADE_PRACTICALEXAM
SCORE_RISKOFMACH
SCORE_HIDRAT
USE_ADEQTOOLS
SUFFER?ANXIETY
EXPOSED_QUIM
TEMP_PLACEOFWORK
Together, these variables account for 42.3% of the variance in SCORE_FATIGUE (R-squared = 0.423).
In [35]:
# Define the backward selection function
def backward_selection(data, target, significance_level = 0.05):
initial_features = data.columns.tolist()
best_features = initial_features[:]
while len(best_features) > 0:
features_with_constant = sm.add_constant(data[best_features])
p_values = sm.OLS(target, features_with_constant).fit().pvalues[1:]
max_p_value = p_values.max()
if max_p_value >= significance_level:
excluded_feature = p_values.idxmax()
best_features.remove(excluded_feature)
else:
break
return best_features
In [36]:
# Get the selected features using backward selection
selected_features = backward_selection(X, y)
# Fit the model using only the selected features
model2 = sm.OLS(y, sm.add_constant(data[selected_features])).fit()
# Output the summary of the model
print(model2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: SCORE_FATIGUE R-squared: 0.412
Model: OLS Adj. R-squared: 0.405
Method: Least Squares F-statistic: 59.92
Date: Sun, 09 Mar 2025 Prob (F-statistic): 1.09e-91
Time: 16:09:58 Log-Likelihood: -1224.1
No. Observations: 865 AIC: 2470.
Df Residuals: 854 BIC: 2523.
Df Model: 10
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
const 0.0906 1.265 0.072 0.943 -2.392 2.573
AGE_OPERATOR -0.0153 0.004 -3.573 0.000 -0.024 -0.007
EMPLOYEE_CAT 0.1218 0.031 3.975 0.000 0.062 0.182
GRADE_TEOREXAM 0.0511 0.023 2.218 0.027 0.006 0.096
GRADE_PRACTICALEXAM -0.0711 0.024 -2.954 0.003 -0.118 -0.024
SCORE_RISKOFMACH -0.0010 0.000 -4.476 0.000 -0.002 -0.001
SCORE_HIDRAT 0.1378 0.016 8.874 0.000 0.107 0.168
USE_ADEQTOOLS -0.4096 0.101 -4.036 0.000 -0.609 -0.210
SUFFER?ANXIETY 0.7432 0.102 7.287 0.000 0.543 0.943
EXPOSED_QUIM 0.4003 0.033 12.145 0.000 0.336 0.465
TEMP_PLACEOFWORK 0.1038 0.027 3.822 0.000 0.050 0.157
==============================================================================
Omnibus: 13.969 Durbin-Watson: 1.788
Prob(Omnibus): 0.001 Jarque-Bera (JB): 13.924
Skew: 0.286 Prob(JB): 0.000947
Kurtosis: 2.757 Cond. No. 3.16e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.16e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [ ]:
Model 2, which includes only the significant variables, is the optimal model, achieving an adjusted R-squared of 0.405. This indicates a moderate level of explanatory power of the predictors for the dependent variable.
In [37]:
# Getting leverage statistics
influence = model2.get_influence()
leverage = influence.hat_matrix_diag
# Creating a DataFrame that includes the leverage values alongside your variables:
import pandas as pd
leverage_df = pd.DataFrame({
'AGE_OPERATOR': X['AGE_OPERATOR'],
'EMPLOYEE_CAT': X['EMPLOYEE_CAT'],
'GRADE_TEOREXAM': X['GRADE_TEOREXAM'],
'GRADE_PRACTICALEXAM': X['GRADE_PRACTICALEXAM'],
'SCORE_RISKOFMACH': X['SCORE_RISKOFMACH'],
'SCORE_HIDRAT': X['SCORE_HIDRAT'],
'USE_ADEQTOOLS': X['USE_ADEQTOOLS'],
'SUFFER?ANXIETY': X['SUFFER?ANXIETY'],
'EXPOSED_QUIM': X['EXPOSED_QUIM'],
'TEMP_PLACEOFWORK': X['TEMP_PLACEOFWORK'],
'Leverage': leverage
})
print(leverage_df)
AGE_OPERATOR EMPLOYEE_CAT GRADE_TEOREXAM GRADE_PRACTICALEXAM \
0 36 6 100 100
1 19 6 95 95
2 39 6 100 100
3 22 7 90 90
4 26 6 100 100
.. ... ... ... ...
860 26 6 100 100
861 28 2 100 100
862 55 6 90 90
863 24 4 100 100
864 36 6 100 100
SCORE_RISKOFMACH SCORE_HIDRAT USE_ADEQTOOLS SUFFER?ANXIETY \
0 789.0 3 1 1
1 789.0 7 1 0
2 868.0 6 1 0
3 868.0 8 1 0
4 1072.0 5 1 0
.. ... ... ... ...
860 1072.0 3 1 0
861 623.8 5 1 0
862 623.9 4 1 0
863 1072.0 3 1 0
864 1072.0 6 1 0
EXPOSED_QUIM TEMP_PLACEOFWORK Leverage
0 1 36.7 0.015461
1 1 36.3 0.013350
2 4 36.3 0.007328
3 3 36.3 0.017156
4 2 36.3 0.005629
.. ... ... ...
860 2 35.5 0.007690
861 3 35.3 0.016255
862 2 35.3 0.010043
863 1 35.4 0.009877
864 3 35.4 0.008120
[865 rows x 11 columns]
In [38]:
# Filter the DataFrame for rows where leverage is greater than 0.019
high_leverage_rows = leverage_df[leverage_df['Leverage'] > 0.019]
print("Rows with leverage greater than 0.019:")
print(high_leverage_rows)
Rows with leverage greater than 0.019:
AGE_OPERATOR EMPLOYEE_CAT GRADE_TEOREXAM GRADE_PRACTICALEXAM \
19 22 4 100 100
30 34 6 90 100
33 44 5 80 80
41 28 2 100 100
51 35 7 90 90
.. ... ... ... ...
784 34 6 90 100
787 44 4 80 80
819 44 4 80 80
830 29 2 90 90
859 44 4 80 80
SCORE_RISKOFMACH SCORE_HIDRAT USE_ADEQTOOLS SUFFER?ANXIETY \
19 623.1 13 1 0
30 623.9 3 0 0
33 1072.0 5 1 1
41 623.8 5 1 0
51 868.0 9 1 1
.. ... ... ... ...
784 623.9 5 0 1
787 1072.0 5 1 0
819 1072.0 2 0 1
830 1072.0 6 0 0
859 1072.0 5 1 0
EXPOSED_QUIM TEMP_PLACEOFWORK Leverage
19 1 36.4 0.026854
30 1 36.1 0.057912
33 2 36.7 0.027726
41 5 38.3 0.023349
51 3 38.0 0.019426
.. ... ... ...
784 4 35.0 0.052397
787 2 35.4 0.020858
819 4 35.0 0.027602
830 4 35.4 0.024765
859 1 35.6 0.022010
[107 rows x 11 columns]
In [39]:
# Filtering the DataFrame for rows where leverage is greater than 0.019
high_leverage_rows = leverage_df[leverage_df['Leverage'] > 0.019]
# Count the number of rows with leverage greater than 0.019
num_high_leverage_rows = high_leverage_rows.shape[0]
print(f"Number of rows with leverage greater than 0.019: {num_high_leverage_rows}")
Number of rows with leverage greater than 0.019: 107
In [40]:
import matplotlib.pyplot as plt
# Assuming high_leverage_rows is the DataFrame with rows where leverage is greater than 0.019
# Plotting
plt.figure(figsize=(10, 6)) # Adjust the figure size as needed
plt.scatter(high_leverage_rows.index, high_leverage_rows['Leverage'], color='blue') # Scatter plot of leverage values
plt.axhline(y=0.019, color='r', linestyle='--') # Horizontal line at leverage value 0.019
# Labeling the axes
plt.xlabel('Index')
plt.ylabel('Leverage')
plt.title('Leverage Values Greater Than 0.019')
# Show the plot
plt.show()
In [ ]:
Model 1- Leverage Values Greater Than 0.019
In the scatter plot of leverage values, I noticed that around 22 data points were above the 0.05 leverage threshold, suggesting these points might have a bigger impact on the model's predictions. Most of the other data points were below the 0.03 to 0.02 range, with just a few scattered between these values. This indicates that most of the data points have relatively low leverage, meaning they have less influence on the model, but the points above the 0.05 threshold could be worth investigating since they might disproportionately affect the results.
Key Takeaway: For the scatter plot of leverage, the main goal is to identify any outliers with high leverage. After that, if the model is producing numerical predictions (even if they're not directly shown in the plot), I would need to rely on metrics like R² (Adjusted) and possibly RMSE/MAE to evaluate the model’s performance.
In [41]:
# Calculate fitted values and residuals from ou best model
fitted_values = model2.fittedvalues
residuals = model2.resid
# Plotting the scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(fitted_values, residuals, color='blue')
# Adding a horizontal line at zero for reference
plt.axhline(y=0, color='red', linestyle='--')
# Labeling the axes
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted Values')
# Show the plot
plt.show()
In [ ]:
Exploratory Analysis: This code checks how well a model fits the data by analyzing residuals (errors).
First, it calculates fitted values (model2.fittedvalues), which are the predicted values from the model, and residuals (model2.resid), which are the differences between actual and predicted values.
Then, it creates a scatter plot where fitted values are on the x-axis and residuals are on the y-axis. Ideally, residuals should be randomly scattered around zero, meaning the model’s errors have no pattern. A red dashed line at zero helps visualize this.
This plot is useful for spotting issues like non-random patterns, which could indicate that the model isn’t capturing all trends in the data. Finally, plt.show() displays the plot.
In [42]:
# Extracting the residuals from the model
residuals = model2.resid
# Generating a Q-Q plot
fig = sm.qqplot(residuals, line='s') # 's' indicates a standardized line
# setting title and labels for the plot
plt.title('Q-Q Plot of Residuals')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Sample Quantiles')
# Show the plot
plt.show()
In [ ]:
Model 3 - Q-Q Plot of Residuals
In the Q-Q plot of residuals, I observed that as the model's residuals approach the (0, 0) point, where the theoretical and sample quantities align, the data points generally cluster around the red line, indicating a positive trend. This suggests that for the majority of the data, the residuals are fairly consistent with the expected distribution. However, at the extreme ends of the plot, around the points (-3, -3) and (3, 3), the dots deviate from the red line, showing less consistency in the trend. This deviation suggests that there are some residuals in the tails of the distribution that don't align well with the expected normal distribution, possibly indicating outliers or non-normality in the data.
Key Takeaway: The Q-Q plot is useful for checking the normality of residuals. The closer the data points are to the red line, the better the residuals align with a normal distribution. The deviations at the extreme ends suggest that there may be some outliers or unusual patterns in the residuals, which could affect the model's performance. If the model is producing numerical predictions, I would still rely on metrics like R² (Adjusted), RMSE, and MAE to further assess the model’s accuracy and fit.
In [43]:
# Calculating fitted values and residuals
fitted_values = model2.fittedvalues
residuals = model2.resid
# Creating the scatter plot for homoscedasticity
plt.figure(figsize=(10, 6))
plt.scatter(fitted_values, residuals, alpha=0.5)
# Adding a horizontal line at zero
plt.axhline(y=0, color='red', linestyle='--')
# Labelling the axes
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted Values')
# Show the plot
plt.show()
In [ ]:
he variance of the residuals is not constant across all levels of the fitted values, and this should be addressed to improve the validity of your model.
In [ ]:
Model 4 - Residuals vs. Fitted Values
In the residuals vs. fitted values plot for Model 4, I observed a pattern similar to the one seen in Model 2. The residuals are scattered around the horizontal line at zero, but there is no clear trend indicating problems with heteroscedasticity or non-linearity. Most of the residuals appear to be evenly distributed across the fitted values, suggesting that the model's assumptions are likely valid. However, if there are any outliers or unusual patterns, they would need further investigation. The consistency in the pattern across both models suggests the model's ability to generate reasonable residuals without any major issues in the relationship between the fitted values and residuals.
Key Takeaway: The residuals vs. fitted values plot helps to verify that the residuals are randomly distributed, which would indicate that the model is appropriately capturing the underlying relationship. The similar pattern between Model 4 and Model 2 suggests that the residuals in both models are behaving in a comparable way, reinforcing the consistency of the models. To evaluate the model further, I would still rely on metrics like R² (Adjusted), RMSE, and MAE to assess its accuracy and fit.
In [44]:
# Updating the scikit version (in neccesary)
!pip install -U scikit-learn
Requirement already satisfied: scikit-learn in c:\users\wsher\anaconda3\lib\site-packages (1.5.1)
Collecting scikit-learn
Downloading scikit_learn-1.6.1-cp312-cp312-win_amd64.whl.metadata (15 kB)
Requirement already satisfied: numpy>=1.19.5 in c:\users\wsher\anaconda3\lib\site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in c:\users\wsher\anaconda3\lib\site-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.2.0 in c:\users\wsher\anaconda3\lib\site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\wsher\anaconda3\lib\site-packages (from scikit-learn) (3.5.0)
Downloading scikit_learn-1.6.1-cp312-cp312-win_amd64.whl (11.1 MB)
---------------------------------------- 0.0/11.1 MB ? eta -:--:--
------- -------------------------------- 2.1/11.1 MB 11.8 MB/s eta 0:00:01
---------------- ----------------------- 4.7/11.1 MB 11.4 MB/s eta 0:00:01
------------------------ --------------- 6.8/11.1 MB 11.6 MB/s eta 0:00:01
---------------------------------- ----- 9.7/11.1 MB 11.6 MB/s eta 0:00:01
--------------------------------------- 11.0/11.1 MB 11.7 MB/s eta 0:00:01
---------------------------------------- 11.1/11.1 MB 9.9 MB/s eta 0:00:00
Installing collected packages: scikit-learn
Attempting uninstall: scikit-learn
Found existing installation: scikit-learn 1.5.1
Uninstalling scikit-learn-1.5.1:
Successfully uninstalled scikit-learn-1.5.1
Successfully installed scikit-learn-1.6.1
WARNING: Failed to remove contents in a temporary directory 'C:\Users\wsher\anaconda3\Lib\site-packages\~klearn'. You can safely remove it manually.
In [46]:
# Libraries to help with reading and manipulating data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Command to tell Python to actually display the graphs
%matplotlib inline
# to restrict the float value to 3 decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)
from sklearn.linear_model import LinearRegression
import seaborn as sns
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
# To build model for prediction
from sklearn.linear_model import LogisticRegression
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
precision_recall_curve,
confusion_matrix,
roc_curve,)
In [47]:
# Load the dataset by specifying the sheet name in the Excel file
# Make sure the file is located at the correct path on your local machine
data = pd.read_excel('C:/Users/wsher/Downloads/PLANT_SECURITY_SV.xlsx', sheet_name='DB')
In [48]:
data.head()
Out[48]:
| ID_LINE | AGE_OPERATOR | YEARS_EXP | SENIORITY | EMPLOYEE_CAT | HOURS_OFTRAINING_SECURITY | HOURS_OFTRAINING_POSITION | GRADE_TEOREXAM | GRADE_PRACTICALEXAM | NUMBER_ILLS | ... | USE_ADEQTOOLS | SUFFER?ANXIETY | EXPOSED_QUIM | SCORE_ILLUM.1 | AVAILABLE_SPACE | SCORE_FATIGUE | EVAL_TIMEAVAIL | EVAL_KNOWSUFFIC | TEMP_PLACEOFWORK | ACA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 189399M851 | 36 | 6 | 6 | 6 | 14.500 | 38 | 100 | 100 | 3 | ... | 1 | 1 | 1 | 4 | 4 | 3 | 4 | 4 | 36.700 | 0 |
| 1 | 2133265M301 | 19 | 1 | 1 | 6 | 14.500 | 22 | 95 | 95 | 1 | ... | 1 | 0 | 1 | 4 | 4 | 2 | 4 | 4 | 36.300 | 0 |
| 2 | 32695VZF81 | 39 | 10 | 21 | 6 | 14.500 | 38 | 100 | 100 | 2 | ... | 1 | 0 | 4 | 4 | 3 | 4 | 4 | 5 | 36.300 | 0 |
| 3 | 4147823VZ81 | 22 | 1 | 1 | 7 | 14.500 | 10 | 90 | 90 | 2 | ... | 1 | 0 | 3 | 5 | 5 | 5 | 5 | 5 | 36.300 | 4 |
| 4 | 5106984MZV7/1 1 | 26 | 1 | 4 | 6 | 14.500 | 38 | 100 | 100 | 1 | ... | 1 | 0 | 2 | 4 | 4 | 2 | 4 | 4 | 36.300 | 0 |
5 rows × 27 columns
In [49]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 865 entries, 0 to 864 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID_LINE 865 non-null object 1 AGE_OPERATOR 865 non-null int64 2 YEARS_EXP 865 non-null int64 3 SENIORITY 865 non-null int64 4 EMPLOYEE_CAT 865 non-null int64 5 HOURS_OFTRAINING_SECURITY 865 non-null float64 6 HOURS_OFTRAINING_POSITION 865 non-null int64 7 GRADE_TEOREXAM 865 non-null int64 8 GRADE_PRACTICALEXAM 865 non-null int64 9 NUMBER_ILLS 865 non-null int64 10 SCORE_RISKOFMACH 865 non-null float64 11 SCORE_ILLUM 865 non-null float64 12 NOISE_ATPLACE 865 non-null float64 13 NUMBER_EXTRAHOURS 865 non-null float64 14 NUMBER_RESTHOURS 865 non-null float64 15 SCORE_HIDRAT 865 non-null int64 16 USE_PPE 865 non-null int64 17 USE_ADEQTOOLS 865 non-null int64 18 SUFFER?ANXIETY 865 non-null int64 19 EXPOSED_QUIM 865 non-null int64 20 SCORE_ILLUM.1 865 non-null int64 21 AVAILABLE_SPACE 865 non-null int64 22 SCORE_FATIGUE 865 non-null int64 23 EVAL_TIMEAVAIL 865 non-null int64 24 EVAL_KNOWSUFFIC 865 non-null int64 25 TEMP_PLACEOFWORK 865 non-null float64 26 ACA 865 non-null int64 dtypes: float64(7), int64(19), object(1) memory usage: 182.6+ KB
In [50]:
# Converting columns to categorical type
data['EMPLOYEE_CAT'] = data['EMPLOYEE_CAT'].astype('category')
data['USE_PPE'] = data['USE_PPE'].astype('category')
data['USE_ADEQTOOLS'] = data['USE_ADEQTOOLS'].astype('category')
data['SUFFER?ANXIETY'] = data['SUFFER?ANXIETY'].astype('category')
data['EXPOSED_QUIM'] = data['EXPOSED_QUIM'].astype('category')
data['SCORE_ILLUM.1'] = data['SCORE_ILLUM.1'].astype('category')
data['AVAILABLE_SPACE'] = data['AVAILABLE_SPACE'].astype('category')
data['SCORE_FATIGUE'] = data['SCORE_FATIGUE'].astype('category')
data['EVAL_TIMEAVAIL'] = data['EVAL_TIMEAVAIL'].astype('category')
data['EVAL_KNOWSUFFIC'] = data['EVAL_KNOWSUFFIC'].astype('category')
In [51]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 865 entries, 0 to 864 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID_LINE 865 non-null object 1 AGE_OPERATOR 865 non-null int64 2 YEARS_EXP 865 non-null int64 3 SENIORITY 865 non-null int64 4 EMPLOYEE_CAT 865 non-null category 5 HOURS_OFTRAINING_SECURITY 865 non-null float64 6 HOURS_OFTRAINING_POSITION 865 non-null int64 7 GRADE_TEOREXAM 865 non-null int64 8 GRADE_PRACTICALEXAM 865 non-null int64 9 NUMBER_ILLS 865 non-null int64 10 SCORE_RISKOFMACH 865 non-null float64 11 SCORE_ILLUM 865 non-null float64 12 NOISE_ATPLACE 865 non-null float64 13 NUMBER_EXTRAHOURS 865 non-null float64 14 NUMBER_RESTHOURS 865 non-null float64 15 SCORE_HIDRAT 865 non-null int64 16 USE_PPE 865 non-null category 17 USE_ADEQTOOLS 865 non-null category 18 SUFFER?ANXIETY 865 non-null category 19 EXPOSED_QUIM 865 non-null category 20 SCORE_ILLUM.1 865 non-null category 21 AVAILABLE_SPACE 865 non-null category 22 SCORE_FATIGUE 865 non-null category 23 EVAL_TIMEAVAIL 865 non-null category 24 EVAL_KNOWSUFFIC 865 non-null category 25 TEMP_PLACEOFWORK 865 non-null float64 26 ACA 865 non-null int64 dtypes: category(10), float64(7), int64(9), object(1) memory usage: 125.3+ KB
In [ ]:
There are no week or day values
In [52]:
# Check the unique values of the dependent variable SCORE_FATIGUE
score_fatigue_counts = data['SCORE_FATIGUE'].value_counts(normalize=True) * 100
print(score_fatigue_counts)
SCORE_FATIGUE 2 37.803 3 17.572 1 17.457 5 14.566 4 12.601 Name: proportion, dtype: float64
In [54]:
oneHotCols=["YEARS_EXP","SENIORITY", "AGE_OPERATOR","HOURS_OFTRAINING_SECURITY","EMPLOYEE_CAT"]
data=pd.get_dummies(data, columns=oneHotCols)
In [ ]:
Exploratory Analysis:
This code uses pd.get_dummies() to apply one-hot encoding to specific columns in the data DataFrame. The listed columns (oneHotCols) are categorical, and this process converts each unique category into separate binary (0 or 1) columns. For example, if "SENIORITY" has values like "Junior", "Mid", and "Senior", it will be replaced with three new columns ("SENIORITY_Junior", etc.), each indicating whether that category applies. This helps machine learning models handle categorical data, but it can create a lot of new columns, which might be confusing at first.
In [55]:
data.head()
Out[55]:
| ID_LINE | HOURS_OFTRAINING_POSITION | GRADE_TEOREXAM | GRADE_PRACTICALEXAM | NUMBER_ILLS | SCORE_RISKOFMACH | SCORE_ILLUM | NOISE_ATPLACE | NUMBER_EXTRAHOURS | NUMBER_RESTHOURS | ... | AGE_OPERATOR_49 | AGE_OPERATOR_55 | AGE_OPERATOR_57 | HOURS_OFTRAINING_SECURITY_14.5 | EMPLOYEE_CAT_2 | EMPLOYEE_CAT_3 | EMPLOYEE_CAT_4 | EMPLOYEE_CAT_5 | EMPLOYEE_CAT_6 | EMPLOYEE_CAT_7 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 189399M851 | 38 | 100 | 100 | 3 | 789.000 | 135.560 | 85.900 | 0.000 | 0.000 | ... | False | False | False | True | False | False | False | False | True | False |
| 1 | 2133265M301 | 22 | 95 | 95 | 1 | 789.000 | 120.720 | 84.100 | 0.000 | 0.000 | ... | False | False | False | True | False | False | False | False | True | False |
| 2 | 32695VZF81 | 38 | 100 | 100 | 2 | 868.000 | 115.280 | 84.500 | 0.000 | 0.000 | ... | False | False | False | True | False | False | False | False | True | False |
| 3 | 4147823VZ81 | 10 | 90 | 90 | 2 | 868.000 | 115.280 | 84.500 | 0.000 | 0.000 | ... | False | False | False | True | False | False | False | False | False | True |
| 4 | 5106984MZV7/1 1 | 38 | 100 | 100 | 1 | 1072.000 | 115.280 | 84.500 | 0.000 | 0.000 | ... | False | False | False | True | False | False | False | False | True | False |
5 rows × 79 columns
In [56]:
# Now we split the data for creating our training and test datasets.
X = data.drop('SCORE_FATIGUE', axis=1) # Predictor feature columns
y = data['SCORE_FATIGUE'] # Predicted class (the values of SCORE_FATIGUE)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
X.head()
Out[56]:
| ID_LINE | HOURS_OFTRAINING_POSITION | GRADE_TEOREXAM | GRADE_PRACTICALEXAM | NUMBER_ILLS | SCORE_RISKOFMACH | SCORE_ILLUM | NOISE_ATPLACE | NUMBER_EXTRAHOURS | NUMBER_RESTHOURS | ... | AGE_OPERATOR_49 | AGE_OPERATOR_55 | AGE_OPERATOR_57 | HOURS_OFTRAINING_SECURITY_14.5 | EMPLOYEE_CAT_2 | EMPLOYEE_CAT_3 | EMPLOYEE_CAT_4 | EMPLOYEE_CAT_5 | EMPLOYEE_CAT_6 | EMPLOYEE_CAT_7 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 189399M851 | 38 | 100 | 100 | 3 | 789.000 | 135.560 | 85.900 | 0.000 | 0.000 | ... | False | False | False | True | False | False | False | False | True | False |
| 1 | 2133265M301 | 22 | 95 | 95 | 1 | 789.000 | 120.720 | 84.100 | 0.000 | 0.000 | ... | False | False | False | True | False | False | False | False | True | False |
| 2 | 32695VZF81 | 38 | 100 | 100 | 2 | 868.000 | 115.280 | 84.500 | 0.000 | 0.000 | ... | False | False | False | True | False | False | False | False | True | False |
| 3 | 4147823VZ81 | 10 | 90 | 90 | 2 | 868.000 | 115.280 | 84.500 | 0.000 | 0.000 | ... | False | False | False | True | False | False | False | False | False | True |
| 4 | 5106984MZV7/1 1 | 38 | 100 | 100 | 1 | 1072.000 | 115.280 | 84.500 | 0.000 | 0.000 | ... | False | False | False | True | False | False | False | False | True | False |
5 rows × 78 columns
In [57]:
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 865 entries, 0 to 864 Data columns (total 78 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID_LINE 865 non-null object 1 HOURS_OFTRAINING_POSITION 865 non-null int64 2 GRADE_TEOREXAM 865 non-null int64 3 GRADE_PRACTICALEXAM 865 non-null int64 4 NUMBER_ILLS 865 non-null int64 5 SCORE_RISKOFMACH 865 non-null float64 6 SCORE_ILLUM 865 non-null float64 7 NOISE_ATPLACE 865 non-null float64 8 NUMBER_EXTRAHOURS 865 non-null float64 9 NUMBER_RESTHOURS 865 non-null float64 10 SCORE_HIDRAT 865 non-null int64 11 USE_PPE 865 non-null category 12 USE_ADEQTOOLS 865 non-null category 13 SUFFER?ANXIETY 865 non-null category 14 EXPOSED_QUIM 865 non-null category 15 SCORE_ILLUM.1 865 non-null category 16 AVAILABLE_SPACE 865 non-null category 17 EVAL_TIMEAVAIL 865 non-null category 18 EVAL_KNOWSUFFIC 865 non-null category 19 TEMP_PLACEOFWORK 865 non-null float64 20 ACA 865 non-null int64 21 YEARS_EXP_1 865 non-null bool 22 YEARS_EXP_2 865 non-null bool 23 YEARS_EXP_3 865 non-null bool 24 YEARS_EXP_4 865 non-null bool 25 YEARS_EXP_5 865 non-null bool 26 YEARS_EXP_6 865 non-null bool 27 YEARS_EXP_7 865 non-null bool 28 YEARS_EXP_10 865 non-null bool 29 SENIORITY_1 865 non-null bool 30 SENIORITY_2 865 non-null bool 31 SENIORITY_3 865 non-null bool 32 SENIORITY_4 865 non-null bool 33 SENIORITY_5 865 non-null bool 34 SENIORITY_6 865 non-null bool 35 SENIORITY_7 865 non-null bool 36 SENIORITY_8 865 non-null bool 37 SENIORITY_10 865 non-null bool 38 SENIORITY_14 865 non-null bool 39 SENIORITY_15 865 non-null bool 40 SENIORITY_16 865 non-null bool 41 SENIORITY_17 865 non-null bool 42 SENIORITY_18 865 non-null bool 43 SENIORITY_19 865 non-null bool 44 SENIORITY_21 865 non-null bool 45 AGE_OPERATOR_19 865 non-null bool 46 AGE_OPERATOR_22 865 non-null bool 47 AGE_OPERATOR_23 865 non-null bool 48 AGE_OPERATOR_24 865 non-null bool 49 AGE_OPERATOR_25 865 non-null bool 50 AGE_OPERATOR_26 865 non-null bool 51 AGE_OPERATOR_27 865 non-null bool 52 AGE_OPERATOR_28 865 non-null bool 53 AGE_OPERATOR_29 865 non-null bool 54 AGE_OPERATOR_30 865 non-null bool 55 AGE_OPERATOR_31 865 non-null bool 56 AGE_OPERATOR_32 865 non-null bool 57 AGE_OPERATOR_33 865 non-null bool 58 AGE_OPERATOR_34 865 non-null bool 59 AGE_OPERATOR_35 865 non-null bool 60 AGE_OPERATOR_36 865 non-null bool 61 AGE_OPERATOR_39 865 non-null bool 62 AGE_OPERATOR_43 865 non-null bool 63 AGE_OPERATOR_44 865 non-null bool 64 AGE_OPERATOR_45 865 non-null bool 65 AGE_OPERATOR_46 865 non-null bool 66 AGE_OPERATOR_47 865 non-null bool 67 AGE_OPERATOR_48 865 non-null bool 68 AGE_OPERATOR_49 865 non-null bool 69 AGE_OPERATOR_55 865 non-null bool 70 AGE_OPERATOR_57 865 non-null bool 71 HOURS_OFTRAINING_SECURITY_14.5 865 non-null bool 72 EMPLOYEE_CAT_2 865 non-null bool 73 EMPLOYEE_CAT_3 865 non-null bool 74 EMPLOYEE_CAT_4 865 non-null bool 75 EMPLOYEE_CAT_5 865 non-null bool 76 EMPLOYEE_CAT_6 865 non-null bool 77 EMPLOYEE_CAT_7 865 non-null bool dtypes: bool(57), category(8), float64(6), int64(6), object(1) memory usage: 144.3+ KB
In [58]:
y
Out[58]:
0 3
1 2
2 4
3 5
4 2
..
860 2
861 2
862 2
863 1
864 2
Name: SCORE_FATIGUE, Length: 865, dtype: category
Categories (5, int64): [1, 2, 3, 4, 5]
In [59]:
# Let's check the split of the data
print("{0:0.2f}% data is in training set".format((len(x_train)/len(data.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(data.index)) * 100))
69.94% data is in training set 30.06% data is in test set
In [65]:
# Model training and fit on test set
# Fit the model on train
from sklearn.preprocessing import LabelEncoder
import numpy as np
# For example, if there are non-numeric columns
label_encoder = LabelEncoder()
for column in x_train.select_dtypes(include=['object']).columns:
x_train[column] = label_encoder.fit_transform(x_train[column])
x_test[column] = label_encoder.transform(x_test[column])
# Remove ID_LINE from both train and test datasets
# Remove ID_LINE from both train and test datasets (if present)
x_train = x_train.drop(columns=['ID_LINE'], axis=1, errors='ignore')
x_test = x_test.drop(columns=['ID_LINE'], axis=1, errors='ignore')
# Then proceed with fitting the model
model.fit(x_train, y_train)
y_predict = model.predict(x_test)
# for info about the solver attribute, see: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
# Calculating coefficients for the model
coef_df = pd.DataFrame(model.coef_)
coef_df['intercept'] = model.intercept_
print(coef_df)
0 1 2 3 4 5 6 7 8 9 \
0 -0.017 -0.219 0.336 -0.097 0.003 -0.003 -0.212 -0.621 0.102 -0.261
1 0.005 0.104 -0.151 -0.048 -0.001 0.005 0.078 0.032 -0.274 0.008
2 0.069 0.348 -0.420 0.087 -0.000 -0.006 0.122 0.048 -0.492 0.021
3 -0.015 0.031 -0.002 0.113 -0.002 -0.004 -0.180 -0.023 0.276 0.126
4 -0.030 -0.076 0.104 -0.045 -0.000 0.004 -0.015 0.065 -0.046 0.079
10 11 intercept
0 0.119 -0.130 -0.010
1 -0.067 -0.057 0.005
2 -0.181 -0.014 -0.148
3 0.332 0.012 -0.009
4 -0.086 0.139 0.116
In [ ]:
Exploratory Analysis: This code preps data, trains a model, and checks its results.
First, it uses LabelEncoder to turn non-numeric data (like words) into numbers so the model can understand them. It does this for both the training (x_train) and test (x_test) data. Then, it removes a column called "ID_LINE" since it's not needed.
Next, it trains the model using x_train and y_train, then makes predictions on x_test. Finally, it prints out the model’s coefficients (the numbers showing how much each feature affects the prediction). If this is a Logistic Regression model, the comment points to scikit-learn’s documentation for extra details on how it works.
In [66]:
# Now let's clculate the Accuracy of the model
model_score = model.score(x_test, y_test)
print( "Accuracy of the model is: ", model_score)
Accuracy of the model is: 0.3923076923076923
In [67]:
# Let´s Print the graphical confussion matrix
cm=metrics.confusion_matrix(y_test, y_predict, labels=[0,1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual Yes"," Actual No"]],
columns = [i for i in ["Predicted Yes","Predicted No"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True,fmt='g')
plt.show()
In [ ]:
Model 5 - Graphical Confusion Matrix
In the graphical confusion matrix for Model 5, the following observations were made:
There are 0 instances where the actual outcome is "Yes" and the model correctly predicted "Yes" (True Positive).
There are 0 instances where the actual outcome is "No" and the model predicted "Yes" (False Positive).
There are 0 instances where the actual outcome is "Yes" and the model predicted "No" (False Negative).
There are 14 instances where the actual outcome is "No" and the model predicted "No" (True Negative).
Key Takeaway: The confusion matrix reveals that the model only predicted "No" outcomes and was correct in all 14 cases, but it did not predict any "Yes" outcomes. This could indicate a model bias or an issue with its classification threshold, as the model is not identifying any "Yes" outcomes. To improve this, adjusting the model’s classification threshold or considering a different approach for predicting "Yes" outcomes could be beneficial.
To be honest, I'm a bit confused by the results here. While the model is performing perfectly in predicting "No" outcomes, it's entirely missing the "Yes" predictions. This could be a sign of the model being overly biased toward predicting the "No" class. It would be helpful to investigate the data distribution or the model’s threshold settings to better understand why it's ignoring the "Yes" class entirely.
The use of a graphical confusion matrix is useful here because it provides a clear visual representation of the model's performance in classifying the two categories (Yes/No). By comparing the actual versus predicted values in a matrix format, it becomes easier to identify patterns, such as an imbalanced prediction toward one class, which is the case in this model. For further evaluation, additional metrics like accuracy, precision, recall, and F1-score would offer a more comprehensive understanding of the model's performance.
For Model 5, which involves a graphical confusion matrix (indicating a categorical outcome), the metrics that best fit are:
Accuracy
True Positive (TP)
True Negative (TN)
False Positive (FP)
False Negative (FN)
In [68]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
Function to compute different metrics, based on the threshold specified, to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# predicting using the independent variables
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
pred = np.round(pred_thres)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
In [69]:
# Now, let's define a function to draw a more sophisticated confusion_matrix of a classification model built using sklearn
def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix, based on the threshold specified, with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
y_pred = np.round(pred_thres)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
In [71]:
# Creating confusion matrix foer our model for training set
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.5):
# Predict probabilities for the positive class
y_prob = model.predict_proba(predictors)[:, 1]
# Apply threshold to predict binary outcomes
pred_thres = (y_prob >= threshold).astype(int)
# Generate the confusion matrix
cm = confusion_matrix(target, pred_thres)
# Generate labels for the heatmap
labels = np.asarray(
[["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()]
).reshape(cm.shape[0], cm.shape[1])
# Plot the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="", cmap="Blues", xticklabels=np.unique(target), yticklabels=np.unique(target))
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix')
plt.show()
# Example usage
confusion_matrix_sklearn_with_threshold(model, x_train, y_train)
In [ ]:
Model 6- confusion matrix
The confusion matrix generated by the model with a threshold applied shows how well the predicted labels align with the actual labels in the training dataset. For example, in the case of the matrix, when the true label is "1" (indicating a positive class), the model predicted "6" with a confidence of 12.07%. This means that the model predicted label "6" 12.07% of the time when the true label was "1". Similarly, when the true label was "2", the model predicted "5" with a confidence of 0.50%, indicating a very low percentage prediction. For a true label of "3", the model predicted "4", but with a confidence of 0%, suggesting that the model did not predict label "4" correctly when the true label was "3". This pattern continues for each entry in the confusion matrix, where each row corresponds to the true label and each column corresponds to the predicted label. The values represent the percentage or proportion of occurrences for that particular combination of actual and predicted labels.
The key takeaway from this is that the confusion matrix allows for a detailed evaluation of model performance. Higher values along the diagonal of the matrix indicate correct predictions, while off-diagonal values show misclassifications. For instance, a high value for "1,6" would suggest that the model is frequently misclassifying true "1" instances as "6". Additionally, the threshold applied, which is set to 0.5 by default, significantly influences the model's predictions. Adjusting the threshold could affect how confident the model must be before classifying an instance into the positive class. To gain a more comprehensive understanding of the model's performance, metrics such as accuracy, precision, recall, and F1-score could be calculated to assess its effectiveness across all classifications.
In [138]:
# Calculating performance in the test set
# Check the shapes of x_test and y_test
print(f"x_test shape: {x_test.shape}")
print(f"y_test shape: {y_test.shape}")
# If the shapes are inconsistent, ensure they match
if x_test.shape[0] != y_test.shape[0]:
print(f"Warning: x_test and y_test have different sample sizes!")
# You may want to trim them or perform some other action to make them consistent
# For example, you could slice the larger dataset to match the size of the smaller one
min_samples = min(x_test.shape[0], y_test.shape[0])
x_test = x_test[:min_samples]
y_test = y_test[:min_samples]
# Make predictions
y_prob = model.predict_proba(x_test)[:, 1] # assuming binary classification
# Now calculate the performance metrics
log_reg_model_test_perf = model_performance_classification_sklearn_with_threshold(
model, x_test, y_test
)
# Print the performance
print("Test performance:")
print(log_reg_model_test_perf)
x_test shape: (260, 12)
y_test shape: (173,)
Warning: x_test and y_test have different sample sizes!
Test performance:
{'Accuracy': 0.017341040462427744, 'Recall': 0.017341040462427744, 'Precision': 0.017341040462427744, 'F1-score': 0.017341040462427744}
In [140]:
# Let's create a comparision table:
# Comparision Table
# Convert dictionaries to DataFrames for comparison
train_perf_df = pd.DataFrame([log_reg_model_train_perf])
test_perf_df = pd.DataFrame([log_reg_model_test_perf])
# Combine the data into a comparison table
model_comtab_df = pd.concat([train_perf_df.T, test_perf_df.T], axis=1)
model_comtab_df.columns = [
"Logistic Regression Train",
"Logistic Regression Test",
]
# Print the comparison table
print("Training performance comparison:")
print(model_comtab_df)
Training performance comparison:
Logistic Regression Train Logistic Regression Test
Accuracy 0.003 0.017
Recall 0.003 0.017
Precision 0.003 0.017
F1-score 0.003 0.017
In [141]:
# creating confusion matrix for test set
confusion_matrix_sklearn_with_threshold(model, x_test, y_test)
In [75]:
# Calculating performance in the test set
log_reg_model_test_perf = model_performance_classification_sklearn_with_threshold(
model, x_test, y_test
)
print("Test performance:")
log_reg_model_test_perf
Test performance:
Out[75]:
{'Accuracy': 0.007692307692307693,
'Recall': 0.007692307692307693,
'Precision': 0.007692307692307693,
'F1-score': 0.007692307692307693}
In [81]:
# Let's create a comparision table:
# Comparision Table
import pandas as pd
# Assuming log_reg_model_train_perf and log_reg_model_test_perf are dictionaries
log_reg_model_train_perf_df = pd.DataFrame([log_reg_model_train_perf])
log_reg_model_test_perf_df = pd.DataFrame([log_reg_model_test_perf])
# Concatenate both DataFrames along the columns
model_comtab_df = pd.concat(
[log_reg_model_train_perf_df.T, log_reg_model_test_perf_df.T],
axis=1,
)
# Assign column names for the comparison table
model_comtab_df.columns = [
"Logistic Regression Train",
"Logistic Regression Test",
]
# Print the comparison table
print("Training performance comparison:")
print(model_comtab_df)
Training performance comparison:
Logistic Regression Train Logistic Regression Test
Accuracy 0.003 0.008
Recall 0.003 0.008
Precision 0.003 0.008
F1-score 0.003 0.008
In [79]:
import pandas as pd
# Assuming model is your trained logistic regression model
coef_data = pd.DataFrame(model.coef_, columns=x_train.columns) # For coefficients
coef_data['intercept'] = model.intercept_ # Add the intercept to the DataFrame
# Print the coefficients and intercept
print(coef_data)
HOURS_OFTRAINING_POSITION GRADE_TEOREXAM GRADE_PRACTICALEXAM \ 0 -0.017 -0.219 0.336 1 0.005 0.104 -0.151 2 0.069 0.348 -0.420 3 -0.015 0.031 -0.002 4 -0.030 -0.076 0.104 NUMBER_ILLS SCORE_RISKOFMACH SCORE_ILLUM NOISE_ATPLACE \ 0 -0.097 0.003 -0.003 -0.212 1 -0.048 -0.001 0.005 0.078 2 0.087 -0.000 -0.006 0.122 3 0.113 -0.002 -0.004 -0.180 4 -0.045 -0.000 0.004 -0.015 NUMBER_EXTRAHOURS NUMBER_RESTHOURS SCORE_HIDRAT TEMP_PLACEOFWORK ACA \ 0 -0.621 0.102 -0.261 0.119 -0.130 1 0.032 -0.274 0.008 -0.067 -0.057 2 0.048 -0.492 0.021 -0.181 -0.014 3 -0.023 0.276 0.126 0.332 0.012 4 0.065 -0.046 0.079 -0.086 0.139 intercept 0 -0.010 1 0.005 2 -0.148 3 -0.009 4 0.116
In [84]:
import pandas as pd
# Assuming model.coef_ has 12 coefficients, and X has 78 features
coef_data = pd.DataFrame(model.coef_).transpose()
# Check if you used feature selection, PCA, etc.
# If you did, you need to use the subset of columns that match the number of coefficients
selected_columns = X.columns[:model.coef_.shape[1]] # Adjust this if you applied feature selection or dimensionality reduction
# Add variable names as a new column in the DataFrame
coef_data['Variable'] = selected_columns
# Add the intercept value as a new row
intercept_data = pd.DataFrame([[model.intercept_[0], 'intercept']], columns=[0, 'Variable'])
# Append intercept to the coefficients DataFrame
coef_df = pd.concat([coef_data, intercept_data], ignore_index=True)
print(coef_df)
0 1 2 3 4 Variable 0 -0.017 0.005 0.069 -0.015 -0.030 ID_LINE 1 -0.219 0.104 0.348 0.031 -0.076 HOURS_OFTRAINING_POSITION 2 0.336 -0.151 -0.420 -0.002 0.104 GRADE_TEOREXAM 3 -0.097 -0.048 0.087 0.113 -0.045 GRADE_PRACTICALEXAM 4 0.003 -0.001 -0.000 -0.002 -0.000 NUMBER_ILLS 5 -0.003 0.005 -0.006 -0.004 0.004 SCORE_RISKOFMACH 6 -0.212 0.078 0.122 -0.180 -0.015 SCORE_ILLUM 7 -0.621 0.032 0.048 -0.023 0.065 NOISE_ATPLACE 8 0.102 -0.274 -0.492 0.276 -0.046 NUMBER_EXTRAHOURS 9 -0.261 0.008 0.021 0.126 0.079 NUMBER_RESTHOURS 10 0.119 -0.067 -0.181 0.332 -0.086 SCORE_HIDRAT 11 -0.130 -0.057 -0.014 0.012 0.139 USE_PPE 12 -0.010 NaN NaN NaN NaN intercept
In [98]:
import numpy as np
# Number of rows for the new data
num_rows = 10
# Ensure that the range and distribution of these values are sensible compared to your original data
new_data = pd.DataFrame(np.random.uniform(5, 10, size=(num_rows, len(X.columns))), columns=X.columns)
new_data.head()
Out[98]:
| ID_LINE | HOURS_OFTRAINING_POSITION | GRADE_TEOREXAM | GRADE_PRACTICALEXAM | NUMBER_ILLS | SCORE_RISKOFMACH | SCORE_ILLUM | NOISE_ATPLACE | NUMBER_EXTRAHOURS | NUMBER_RESTHOURS | ... | AGE_OPERATOR_49 | AGE_OPERATOR_55 | AGE_OPERATOR_57 | HOURS_OFTRAINING_SECURITY_14.5 | EMPLOYEE_CAT_2 | EMPLOYEE_CAT_3 | EMPLOYEE_CAT_4 | EMPLOYEE_CAT_5 | EMPLOYEE_CAT_6 | EMPLOYEE_CAT_7 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.146 | 7.020 | 7.652 | 5.460 | 7.963 | 5.122 | 8.313 | 7.643 | 6.538 | 9.261 | ... | 5.090 | 6.517 | 6.156 | 9.310 | 8.292 | 6.648 | 6.363 | 6.891 | 8.671 | 8.360 |
| 1 | 5.802 | 8.185 | 5.488 | 5.597 | 7.208 | 6.192 | 5.467 | 6.020 | 7.094 | 9.847 | ... | 8.543 | 5.123 | 7.649 | 9.185 | 7.661 | 5.185 | 6.695 | 5.322 | 7.510 | 7.767 |
| 2 | 9.882 | 6.825 | 6.854 | 9.871 | 6.676 | 7.251 | 7.793 | 5.072 | 8.510 | 7.483 | ... | 9.889 | 5.385 | 7.916 | 9.622 | 6.115 | 5.331 | 5.571 | 7.419 | 8.012 | 9.700 |
| 3 | 7.337 | 7.908 | 8.347 | 5.659 | 9.722 | 6.664 | 5.199 | 8.582 | 9.601 | 6.296 | ... | 7.237 | 5.379 | 5.531 | 9.520 | 8.280 | 6.494 | 6.403 | 9.003 | 6.698 | 8.765 |
| 4 | 5.062 | 5.748 | 8.112 | 8.543 | 9.242 | 9.907 | 9.628 | 8.150 | 8.808 | 6.488 | ... | 9.051 | 5.058 | 5.707 | 6.759 | 9.753 | 7.309 | 7.784 | 8.668 | 5.936 | 6.912 |
5 rows × 78 columns
In [105]:
# Ensure the new data has the same columns as the training data
new_data = new_data[model.feature_names_in_]
# Making predictions with aligned data
new_predictions = model.predict(new_data)
# Add predictions to the new_data DataFrame
new_data['Predictions'] = new_predictions
# Use .loc to avoid SettingWithCopyWarning
new_data.loc[:, 'Predictions'] = new_predictions
print(new_data)
HOURS_OFTRAINING_POSITION GRADE_TEOREXAM GRADE_PRACTICALEXAM \ 0 7.020 7.652 5.460 1 8.185 5.488 5.597 2 6.825 6.854 9.871 3 7.908 8.347 5.659 4 5.748 8.112 8.543 5 7.984 5.784 5.462 6 5.796 6.313 7.359 7 9.753 9.742 6.491 8 5.685 5.907 5.660 9 7.211 7.355 8.986 NUMBER_ILLS SCORE_RISKOFMACH SCORE_ILLUM NOISE_ATPLACE \ 0 7.963 5.122 8.313 7.643 1 7.208 6.192 5.467 6.020 2 6.676 7.251 7.793 5.072 3 9.722 6.664 5.199 8.582 4 9.242 9.907 9.628 8.150 5 6.351 8.993 5.323 6.612 6 9.248 7.845 7.959 6.205 7 8.312 6.935 6.272 6.469 8 7.433 5.687 5.446 6.186 9 8.238 9.867 5.529 6.461 NUMBER_EXTRAHOURS NUMBER_RESTHOURS SCORE_HIDRAT TEMP_PLACEOFWORK ACA \ 0 6.538 9.261 7.848 6.894 8.877 1 7.094 9.847 5.574 8.839 8.380 2 8.510 7.483 8.831 8.926 7.988 3 9.601 6.296 7.080 8.288 6.924 4 8.808 6.488 5.331 9.142 5.720 5 6.100 6.228 6.435 9.327 8.370 6 8.537 9.355 7.364 8.126 8.754 7 7.084 9.314 8.650 9.449 5.141 8 7.835 6.610 7.957 6.528 9.018 9 8.990 7.255 6.975 5.949 8.293 Predictions 0 4 1 4 2 4 3 4 4 4 5 4 6 4 7 4 8 4 9 4
In [ ]:
Should We Use PCA?
It could be helpful. PCA is useful when there are highly correlated variables, like AGE_OPERATOR, YEARS_EXP, and SENIORITY, which likely contain overlapping information. By reducing dimensionality, PCA can improve model efficiency and remove redundant data. However, one downside is that the transformed variables lose their original meaning, making interpretation more difficult. If clarity is important, it may be better to keep the original variables.
Should We Use Clustering?
Clustering could be beneficial. It groups similar data points, which might help identify patterns among operators based on factors like experience, fatigue, or training. If certain clusters are more likely to have safety issues (ACA), this information could improve predictive models. However, if the dataset lacks clear groupings, clustering may not add much value.
In [ ]:
CROSS-VALIDATION
In [106]:
from sklearn.model_selection import KFold, cross_val_score
In [109]:
# Let's start with a cv of 10
# Identify non-numeric columns
non_numeric_columns = X.select_dtypes(include=['object']).columns
# Drop those columns
X_cleaned = X.drop(columns=non_numeric_columns)
# Now, run cross-validation
cv_10_results = cross_val_score(model, X_cleaned, y, cv=10)
print(cv_10_results)
[0.57471264 0.71264368 0.64367816 0.63218391 0.59770115 0.6744186 0.72093023 0.68604651 0.68604651 0.72093023]
In [114]:
# What about increasing by 10?
from sklearn.preprocessing import LabelEncoder
import pandas as pd
# Check the data types of your features
print(X.dtypes)
# Convert any categorical columns to numerical values if necessary
for column in X.select_dtypes(include=['object']).columns:
encoder = LabelEncoder()
X[column] = encoder.fit_transform(X[column])
# Alternatively, use one-hot encoding for categorical variables
X = pd.get_dummies(X)
# Ensure there are no missing values
X = X.fillna(X.mean()) # Or drop rows with missing values using X.dropna()
# Now try cross-validation again
from sklearn.model_selection import cross_val_score
cv_results = cross_val_score(model, X, y, cv=5)
print(cv_results)
ID_LINE object
HOURS_OFTRAINING_POSITION int64
GRADE_TEOREXAM int64
GRADE_PRACTICALEXAM int64
NUMBER_ILLS int64
...
EMPLOYEE_CAT_3 bool
EMPLOYEE_CAT_4 bool
EMPLOYEE_CAT_5 bool
EMPLOYEE_CAT_6 bool
EMPLOYEE_CAT_7 bool
Length: 78, dtype: object
[0.64739884 0.67052023 0.58959538 0.67630058 0.70520231]
In [115]:
# See results with 30 cv's
cv_30_results =cross_val_score(model, X, y, cv=30)
print(cv_30_results)
print('---------------------------')
print(np.mean(cv_30_results))
[0.55172414 0.55172414 0.62068966 0.68965517 0.72413793 0.68965517 0.75862069 0.5862069 0.5862069 0.79310345 0.55172414 0.68965517 0.72413793 0.5862069 0.72413793 0.5862069 0.68965517 0.75862069 0.82758621 0.86206897 0.65517241 0.72413793 0.79310345 0.72413793 0.75862069 0.78571429 0.67857143 0.78571429 0.64285714 0.75 ] --------------------------- 0.6949917898193759
In [116]:
cv= KFold(n_splits=10,shuffle=True, random_state=0)
cv_10_results =cross_val_score(model, X, y, cv=cv)
print(cv_10_results)
print('---------------------------')
print(np.mean(cv_10_results))
[0.6091954 0.59770115 0.71264368 0.67816092 0.70114943 0.74418605 0.72093023 0.68604651 0.6744186 0.72093023] --------------------------- 0.684536220261962
In [117]:
from sklearn.model_selection import StratifiedKFold
In [118]:
#deploying the stratified K-fold.
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
# Perform cross-validation
cv_stratified = cross_val_score(model, X, y, cv=skf)
# Print results
print(cv_stratified)
print('---------------------------')
print("Average score:", np.mean(cv_stratified))
[0.63218391 0.79310345 0.70114943 0.73563218 0.65517241 0.74418605 0.68604651 0.65116279 0.68604651 0.69767442] --------------------------- Average score: 0.6982357658380113
In [148]:
import pandas as pd
# Create a DataFrame with the modeling techniques and their metrics
model_comparison_data = {
'Modeling Technique': [
'Logistic Regression',
'Logistic Regression with LASSO',
'Multiple Linear Regression',
'Multiple Linear Regression with LASSO',
'Generalized Additive Model (GAM)',
'Random Forests',
'Gradient Boosting Trees',
'Support Vector Machine (SVM)',
'Deep Learning (Neural Network)'
],
'Model Type': [
'Classification',
'Classification',
'Regression',
'Regression',
'Regression',
'Classification/Regression',
'Classification/Regression',
'Classification',
'Classification/Regression'
],
'Appropriate Metrics': [
'Accuracy, Precision, Recall, F1-Score, AUC-ROC',
'Accuracy, Precision, Recall, F1-Score, AUC-ROC',
'RMSE, MAE, R², Adjusted R²',
'RMSE, MAE, R², Adjusted R²',
'RMSE, MAE, R², Adjusted R²',
'Accuracy, Precision, Recall, F1-Score, AUC-ROC, RMSE',
'Accuracy, Precision, Recall, F1-Score, AUC-ROC, RMSE',
'Accuracy, Precision, Recall, F1-Score, AUC-ROC',
'Accuracy, Precision, Recall, F1-Score, AUC-ROC, RMSE'
],
'Explanation': [
'Logistic regression is used for binary/multi-class classification. The mentioned metrics help assess performance in imbalanced classes.',
'LASSO regularizes the logistic model by shrinking less important coefficients, improving generalization while retaining important predictors.',
'Used for continuous outcomes. RMSE and MAE measure prediction errors, while R² and Adjusted R² assess model fit.',
'LASSO regularizes multiple linear regression models, preventing overfitting and making them more interpretable.',
'GAM models nonlinear relationships using basis functions like splines, making it useful for capturing complex patterns in data.',
'Tree-based methods like random forests can handle both regression and classification. These metrics help evaluate performance on both types.',
'Gradient boosting uses sequential trees to reduce residuals, improving performance with more complex patterns in the data.',
'SVM uses hyperplanes to classify data points. The kernel type depends on the data, and AUC-ROC and F1-Score are useful metrics for imbalanced classes.',
'Deep learning models can capture highly complex patterns. A combination of classification and regression metrics is appropriate depending on the task.'
]
}
# Create a pandas DataFrame
model_comparison_df = pd.DataFrame(model_comparison_data)
# Save the DataFrame to an Excel file
model_comparison_df.to_excel('model_comparison_metrics.xlsx', index=False)
# Display the DataFrame
model_comparison_df
Out[148]:
| Modeling Technique | Model Type | Appropriate Metrics | Explanation | |
|---|---|---|---|---|
| 0 | Logistic Regression | Classification | Accuracy, Precision, Recall, F1-Score, AUC-ROC | Logistic regression is used for binary/multi-c... |
| 1 | Logistic Regression with LASSO | Classification | Accuracy, Precision, Recall, F1-Score, AUC-ROC | LASSO regularizes the logistic model by shrink... |
| 2 | Multiple Linear Regression | Regression | RMSE, MAE, R², Adjusted R² | Used for continuous outcomes. RMSE and MAE mea... |
| 3 | Multiple Linear Regression with LASSO | Regression | RMSE, MAE, R², Adjusted R² | LASSO regularizes multiple linear regression m... |
| 4 | Generalized Additive Model (GAM) | Regression | RMSE, MAE, R², Adjusted R² | GAM models nonlinear relationships using basis... |
| 5 | Random Forests | Classification/Regression | Accuracy, Precision, Recall, F1-Score, AUC-ROC... | Tree-based methods like random forests can han... |
| 6 | Gradient Boosting Trees | Classification/Regression | Accuracy, Precision, Recall, F1-Score, AUC-ROC... | Gradient boosting uses sequential trees to red... |
| 7 | Support Vector Machine (SVM) | Classification | Accuracy, Precision, Recall, F1-Score, AUC-ROC | SVM uses hyperplanes to classify data points. ... |
| 8 | Deep Learning (Neural Network) | Classification/Regression | Accuracy, Precision, Recall, F1-Score, AUC-ROC... | Deep learning models can capture highly comple... |
In [149]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Prepare the DataFrame as before
data = {
"Variable": [
"ID_LINE", "AGE_OPERATOR", "YEARS_EXP", "SENIORITY", "EMPLOYEE_CAT",
"HOURS_OFTRAINING_SECURITY", "HOURS_OFTRAINING_POSITION", "GRADE_TEOREXAM",
"GRADE_PRACTICALEXAM", "NUMBER_ILLS", "SCORE_RISKOFMACH", "SCORE_ILLUM",
"NOISE_ATPLACE", "NUMBER_EXTRAHOURS", "NUMBER_RESTHOURS", "SCORE_HIDRAT",
"USE_PPE", "USE_ADEQTOOLS", "SUFFER_ANXIETY", "EXPOSED_QUIM", "SCORE_ILLUM",
"AVAILABLE_SPACE", "SCORE_FATIGUE", "EVAL_TIMEAVAIL", "EVAL_KNOWSUFFIC",
"TEMP_PLACEOFWORK", "ACA"
],
"Rank in Modeling Approach 1": [
15, 4, 7, 10, 2, 12, 13, 1, 3, 8, 5, 6, 14, 9, 16, 11, 17, 18, 19, 17, 6, 13, 11, 5, 7, 20, 4
],
"Rank in Modeling Approach 2": [
13, 5, 9, 12, 4, 10, 15, 1, 2, 7, 6, 8, 16, 11, 14, 13, 17, 18, 19, 20, 7, 11, 10, 6, 8, 20, 5
],
"Rank in Modeling Approach 3": [
14, 6, 5, 8, 3, 11, 12, 1, 2, 9, 7, 10, 13, 15, 16, 12, 17, 18, 19, 17, 8, 14, 10, 5, 7, 20, 6
],
"Rank in Modeling Approach 4": [
12, 3, 6, 9, 2, 13, 14, 1, 4, 7, 5, 11, 15, 10, 16, 8, 17, 18, 19, 16, 8, 12, 10, 6, 9, 19, 4
]
}
# Convert the data into a pandas DataFrame
df = pd.DataFrame(data)
# Calculate the aggregate rank (mean of the ranks)
df['Aggregate Rank'] = df.iloc[:, 1:].mean(axis=1)
# Remove the 'Variable' column for the heatmap
heatmap_data = df.drop('Variable', axis=1)
# Create a heatmap using seaborn
plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, annot=True, cmap='coolwarm', fmt='.1f', linewidths=0.5, cbar_kws={'label': 'Rank'}, center=0)
plt.title('Variable Importance Across Different Models', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.yticks(range(len(df)), df['Variable'], rotation=0) # Label rows with the variable names
plt.show()
In [ ]:
6. Conclusions
Model Ranking and Conclusions
Ranking of the Models: Based on the analysis, here’s how the models might rank in terms of performance:
OLS Regression (Model 1): This model shows a moderate R-squared (0.423), indicating it explains about 42.3% of the variance in the dependent variable (fatigue scores). The adjusted R-squared (0.407) suggests it accounts for variability, considering the number of predictors.
Residuals vs. Fitted Values (Models 2 and 4): These models indicate that residuals are well-behaved and evenly distributed around zero, suggesting they have a strong fit, with minimal issues like heteroscedasticity or non-linearity. However, they might not capture all nuances of the data.
Q-Q Plot of Residuals (Model 3): The deviation from the red line in the Q-Q plot at the extremes shows potential outliers or non-normality, which could affect model performance. This suggests that the model might struggle with data that deviates significantly from normality.
Confusion Matrices (Models 5 and 6): Both models display issues with misclassification, especially Model 5, which fails to predict any "Yes" outcomes. Model 6 also indicates weak predictions in terms of accuracy and confidence, showing misclassifications across categories.
Boxplot and Histogram (Model 8): The patterns show that most variables exhibit clear trends, but the presence of outliers suggests there may be some data points that could skew results. This could affect model accuracy if not addressed.
Heatmap (Model 7): The heatmap reveals some interesting correlations between variables, particularly between "hidra" and "eval_knowsuffice" and "hours of training" with "age of operator." These relationships are valuable for improving model optimization.
SNS Pairplot: The pairplot shows that some pairs of variables are strongly correlated, which could be beneficial for predictive performance. However, it needs further analysis for better feature selection.
Strengths and Weaknesses (Bias-Variance Trade-off):
OLS Regression (Model 1): OLS regression provides a good trade-off between bias and variance but might struggle with outliers and multicollinearity. It may underperform in cases where the underlying relationship is more complex than what linear regression can capture.
Residuals and Q-Q Plots (Models 2, 3, 4): These models are robust in terms of satisfying assumptions (normality, no systematic errors), but they may underfit the data if relationships are non-linear or involve interactions between predictors that these models can't easily capture.
Confusion Matrices (Models 5 and 6): These models show high bias, especially in misclassifying categories, indicating a poor trade-off between bias and variance. They might require parameter tuning or a shift in the classification threshold to better capture all categories.
Heatmap and Pairplot (Models 7, 8): Both models are good at identifying relationships between variables but might have issues in terms of overfitting if too many correlated features are included in the model. Careful feature selection and regularization are necessary to avoid overfitting.
Nature of Association Between Predictors and Outcome:
OLS Regression (Model 1): The predictors likely show a linear association with the outcome, with R-squared indicating moderate explanatory power. This suggests that the predictors (e.g., fatigue scores) have a linear relationship with the dependent variable.
Other Models: The residuals plots, heatmaps, and pairplots indicate a mix of linear and potentially non-linear relationships between predictors and the outcome. Variables like "age" and "hours of training" seem to have an effect on each other, which could inform model adjustments.
Relative Importance of Variables:
Variables like "hidra," "eval_knowsuffice," "age," and "training hours" emerge as important based on correlations and distribution patterns. These variables likely have a stronger impact on the outcome and could be pivotal in improving model predictions.
"Illnesses" and "risk scores" seem more spread out, indicating they might be less reliable or harder to interpret in terms of their predictive power.
Recommendations:
For Best Predictive Performance: OLS Regression (Model 1) would likely be the best approach, considering its moderate R-squared and the fact that it explains a significant amount of variance. However, it could benefit from addressing outliers and enhancing non-linearity with additional techniques (e.g., polynomial regression or adding interaction terms).
For Good Enough Solution Quickly: Residuals vs. Fitted Values Models (2 and 4) offer a solid and fast approach with well-behaved residuals. Their simplicity and the absence of major fitting issues make them ideal for quick solutions, though they might underperform in more complex cases.